Princeton-CDH / geniza

version 4.x of the Princeton Geniza Project
https://geniza.princeton.edu
Apache License 2.0
11 stars 2 forks source link

As an admin, I want data from PGP v3 links database imported into the new database so that I can manage links from the main admin site. #325

Closed rlskoeser closed 2 years ago

rlskoeser commented 3 years ago

testing notes

dev notes

revisions after testing

New manage command to import relevant data from a CSV export of the links table in the PGP v3 database. Records should be imported differently based on the link type. Records with type image, iiif, cudl, and transcription, can all be ignored since we're already handling that content in a different way.

In all cases, we should match by PGPID, but if PGPID is not found search and match on old PGPID for attachments that belong to merged documents.

Script should create admin log entry documenting any sources & footnotes created and any footnote changed.

We only want to run the script once in production, but for developer testing and convenience we may want to make it idempotent.

goitein_note

These links correspond to the typed texts footnotes already in the database. Find the footnote based on PGPID or old PGPID; filter by source title/author if necessary to find the Goitein typed texts source for the document. (If there are multiple candidates on merged documents we'll have to figure out how to resolve that.) Update the footnote with the url to the scan: base url is https://commons.princeton.edu/media/geniza/ and link_target is the rest.

indexcard

It looks like there are no footnotes for index cards; they will have to be created by the script. We will need to make new source records — probably segment into volumes by shelfmark prefix, similar to what we did to make the Goitein typed texts manageable. When creating the new footnote, set view url based on card id in the links table link target, e.g. https://geniza.princeton.edu/indexcards/index.php?a=card&id=6094 Index cards also have thumbnails based on card id: https://geniza.princeton.edu/indexcards/index.php?a=image&id=6094 Not sure if we want to store that anywhere or just have index-card specific logic to generate the thumbnail link if and where we want to embed it.

jewish-traders

There is an existing source, with some footnotes: Letters of Medieval Jewish Traders

Match footnotes when possible; otherwise create new ones. All footnotes should have doc relation = translation.

Link is based on the target link in the table; base url is https://s3.amazonaws.com/goitein-lmjt/ Thumbnail path is based on link url, e.g. https://s3.amazonaws.com/goitein-lmjt/thumbs/01.png if we want to use thumbnails, we may need to request slightly larger versions.

india-traders

This is "India Book" in our database; sources and footnotes do exist. Base url for links is https://s3.amazonaws.com/goitein-india-traders/ ; remaining url is in link target

On the public site, these are labeled as translation; e.g., see https://geniza.princeton.edu/pgpsearch/?a=object&id=5392 But I don't see where the logic is for that; it's not in the links table.

All footnotes should have doc relation = translation.

rlskoeser commented 3 years ago

questions:

richmanrachel commented 3 years ago

@rlskoeser - here is the link for the existing source record for Jewish Traders: https://geniza.cdh.princeton.edu/admin/footnotes/source/170/change/?_changelist_filters=q%3Dmedieval

rlskoeser commented 3 years ago

Great, @richmanrachel thank you! I see there are only 14 footnotes; so the script will try match links to existing footnotes and then create new ones associated with that source if there isn't a footnote.

rlskoeser commented 3 years ago

documenting answers from today's meeting (also updated issue notes)

* do we have existing source record & footnotes for the jewish-traders links?
  • source record exists, some footnotes exist; others will need to be created

    • how valuable/useful are thumbnails? These sources are inconsistent about whether they provide them.
* how does PGP v3 know that the india-traders links are translations?
  • Both the india-traders and jewish-traders sources are all translations, so every footnote should be set accordingly
kmcelwee commented 2 years ago

@richmanrachel

richmanrachel commented 2 years ago

@kmcelwee

kmcelwee commented 2 years ago

@richmanrachel @rlskoeser

There are ~10 PGPIDs with multiple (perhaps duplicate) footnotes with the same source and document. These all fall under the "India Traders" category. Here's one: https://test-geniza.cdh.princeton.edu/admin/corpus/document/4717/change/?_changelist_filters=q%3D4717

Should the URL be applied to all of them? Are there any duplicate / redundant footnotes we can remove?

Here is the list of PGPIDs: 4717, 4721, 4740, 4743, 4738, 5410, 2691, 9089

richmanrachel commented 2 years ago

@kmcelwee - 4717 is definitely a duplicate and a mistake (the only difference between the two transcriptions it pulls is whether the words "recto" "verso" "margins" are in Hebrew or English). Is there a current logic for me to know which of footnotes the English one is associated with? Or do I have to go into BitBucket?

richmanrachel commented 2 years ago

Also, @rlskoeser - is there an easy way to make the source box bigger in document detail view? The names get cut off so I have to open the edit window to see which book the source is from: image

richmanrachel commented 2 years ago

@kmcelwee - I looked at each of these entries and they fall under 2-3 categories.

1) Script merge pulled together 2 transcriptions that are identical in content, but the language of the scholarly additions are different (4717, 4740, 4738) and thus the computer kept both separately.

2) Script merge pulled together 2 transcriptions that are identical in content, but different in format (4721, 4743) and thus the computer kept both separately.

3) Other:

@rlskoeser - do you have a recommendation for workflow? I can't tell which of the two footnotes to delete for categories 1 and 2.

richmanrachel commented 2 years ago

@rlskoeser and @kmcelwee - Just got rid of all the duplicates (and chose the slightly more robust version of 5410 and deleted the other).

I'll pass 2691 and 9089 to Alan.

rlskoeser commented 2 years ago

@richmanrachel I refreshed the data in the test site from production data and ran the new add links script. Here's the summary output from the script:

Imported 9,858 links; ignored 6,301; failed to import 181.
181 documents not found in database.
Created 53 new sources.
Created 8,762 new footnotes; updated 1,068.

I'm attaching a second one of the script with verbose output, so you'll have details on the documents that are not found: missing_link_docs.txt

For your reference, here's the links file I used for import, which I exported from PGP v3 recently. links.csv

This import raises a new problem where we have multiple urls for the same source (i.e., multiple different index cards reference the same text). I mentioned it to Gissoo today, I'm not sure if it's something we should solve in the data model or via design/interface. This document is a good example of the problem: https://test-geniza.cdh.princeton.edu/en/documents/502/scholarship/

Gissoo had some ideas, but one thing she suggested was that if we had labels for these links (locations in our footnotes), it would be a lot better. I wondered if the document links for the goitein would be meaningful as a "location" / label, if we included the portion before the PGPID? Here are some examples:

5C.1.1 NN_ Michael_ pt.1/AIU VII.E.5_1 (PGPID 451).pdf
6B.2.1 BM Transcripts/BL Or. 5542.23_1 (PGPID 468).pdf
5C.1.2 Nahray 101_ pt.1/BL Or. 5542.34_2 (PGPID 469).pd
5C.2.3 Wills_ inventories_ death/Bodl. MS Heb. a 2_9_1 (PGPID 499).pdf
6B.1.11 Mediterranean people materials/Bodl. MS Heb. a 3_24_1 (PGPID 502).pdf

The links file doesn't have labels for the index cards, but it looks like they are labeled by number on the index cards site. We could do the same here, putting something like "Card 5867" in the location.

richmanrachel commented 2 years ago

@rlskoeser - the note cards aren't linked correctly. When I looked up ID 6176, I could click on the note card but it led me to the general site https://geniza.princeton.edu/indexcards/, rather than the correct card. I tried it with a different id and the same thing happened.

rlskoeser commented 2 years ago

@richmanrachel are you clicking on the main citation link (which does currently link to the entire site), or on the "includes" link?

richmanrachel commented 2 years ago

@rlskoeser - ahhh, that does indeed work better! I was not expecting that to be the link. Is this the UI it's going to be? It's confusing.

rlskoeser commented 2 years ago

@richmanrachel yeah, I see that! We could decide not to link the index cards source to the index cards site — I thought it would be nice to provide access to the full site, but maybe it's too confusing. I wonder if labeling the footnote locations (as I asked about above) would help enough?

richmanrachel commented 2 years ago

While trying to check the History, I wanted to edit the document relation and add a note, but when I tried to save I got an error message asking me to fix the URLs that were added by the script: image

rlskoeser commented 2 years ago

@richmanrachel ooh, good catch. will investigate

rlskoeser commented 2 years ago

@richmanrachel we need to urlencode the urls before setting them in the database. I'll start a to-do list for what needs to be done when we kick this back for revision (there will be other things to add to the list, I'm sure!)

richmanrachel commented 2 years ago

@rlskoeser - I do see log entries in the Log Entry list, but not in the Document Detail History page

richmanrachel commented 2 years ago

Imported 9,858 links; ignored 6,301; failed to import 181. 181 documents not found in database. Created 53 new sources. Created 8,762 new footnotes; updated 1,068.

  • @rlskoeser - I'm not quite sure which part of this I should worry about?

I'm attaching a second one of the script with verbose output, so you'll have details on the documents that are not found: missing_link_docs.txt

  • Indeed, I couldn't find the documents I tried to search for either. I guess this will be like the mismatched TEI that we can deal with later since we won't lose any information in the switch?

This import raises a new problem where we have multiple urls for the same source (i.e., multiple different index cards reference the same text). I mentioned it to Gissoo today, I'm not sure if it's something we should solve in the data model or via design/interface. This document is a good example of the problem: https://test-geniza.cdh.princeton.edu/en/documents/502/scholarship/

  • Ooof. Yes, this and 9774 and a few others were like this.

Gissoo had some ideas, but one thing she suggested was that if we had labels for these links (locations in our footnotes), it would be a lot better. I wondered if the document links for the goitein would be meaningful as a "location" / label, if we included the portion before the PGPID? Here are some examples: 5C.1.1 NN Michael pt.1/AIU VII.E.5_1 (PGPID 451).pdf 6B.2.1 BM Transcripts/BL Or. 5542.231 (PGPID 468).pdf 5C.1.2 Nahray 101 pt.1/BL Or. 5542.342 (PGPID 469).pd 5C.2.3 Wills inventories_ death/Bodl. MS Heb. a 2_9_1 (PGPID 499).pdf 6B.1.11 Mediterranean people materials/Bodl. MS Heb. a 3_24_1 (PGPID 502).pdf The links file doesn't have labels for the index cards, but it looks like they are labeled by number on the index cards site. We could do the same here, putting something like "Card 5867" in the location.

  • Marina likes this solution because these labels show how Goitein organized his own files. I find them a bit unfriendly, but it's certainly not worth renaming everything when these are helpful...

I think those were my main questions/thoughts/concerns!

rlskoeser commented 2 years ago

@rlskoeser - I do see log entries in the Log Entry list, but not in the Document Detail History page

This is expected behavior — the document itself is not directly modified; only footnotes and sources. Maybe check a couple of the index card sources and footnotes?

rlskoeser commented 2 years ago

@richmanrachel more responses...

  1. Nothing to worry about with the script summary (except possibly the missing documents), just sharing so you have a sense of what the script did.
  2. I agree with you — I think the missing ids are probably similar to what we ran into with the TEI, those documents no longer exist. And 181 missing out of almost 10k is not bad! After seeing your comment, I had a thought — the script can generate a filtered csv of the missing documents for the link types we're importing, so we have an easier starting point if someone wants to review and re-associate those links at some point in the future.
  3. I think having names for the links will make a big difference, even if the names aren't always meaningful! I have a follow up question for you and Marina about that.
rlskoeser commented 2 years ago

@richmanrachel @mrustow following up with some detail questions on the "locations (which is used as a link name on the scholarship records page).

  1. For the goitein notes (typed texts), I propose we use filename up to but not including the PGPID, so 5C.1.1 NN_ Michael_ pt.1/AIU VII.E.5_1 (PGPID 451).pdf would become something like 5C.1.1 NN Michael pt.1/AIU VII.E.51. Or we could use just the portion before the slash. It looks like the underscore `` has been used to replace some characters; in the part before the first slash I think commas would make sense, but I'm not sure what the underscore should become in the shelfmark portion. Please advise which part of the filename you want me to use and if I should replace any underscores.
  2. For index cards, propose we use location/label "Card ###" — this will be consistent with what is displayed on the index cards site.
  3. India traders labels and filenames end with labels like I-1-2, I-31a, II-13-15; we're using the first part to map to India Book 1/2/3; can we use the part after that as location/label? Are these document numbers?
  4. Jewish Traders labels and filenames are purely numeric, 01.pdf, etc. Are these numbers meaningful and appropriate to use as label/location? Or are they an artifact of how it was digitized?
richmanrachel commented 2 years ago

the script can generate a filtered csv of the missing documents for the link types we're importing, so we have an easier starting point if someone wants to review and re-associate those links at some point in the future.

  • Great idea, @rlskoeser. Let's do that!

For your numbered questions: 1) Good, we agree. I think the underscore between words is just a space, and after the shelfmark it appears to just indicate that Goitein wrote more than one text on this document (there are two linked on the current frontend). So I don't think we need to change the underscores? 2) Yes, use existing card numbers. 3) Yes, we should use this numbering system. I'm pretty sure they're not document numbers but locations within the book. Goitein sectioned most of his books using this type of extensive sorting. 4) Jewish Traders does use these as document numbers in the book, so they are appropriate. We can just add "Traders Document" or something before the number?

rlskoeser commented 2 years ago

The underscores occur before a space, so I don't think they are just spaces. But if you're ok with leaving them, it's easier not to touch them.

For Jewish Traders: please let me know what label you want before the number. Document? It will appear with the citation, so we don't need to repeat Traders necessarily.

richmanrachel commented 2 years ago

@rlskoeser - Oh strange... I don't think we need to touch them.

Yes, Document should be fine for Traders.

rlskoeser commented 2 years ago

@richmanrachel here's the output from running the updated version of the script:

Imported 9,858 links; ignored 6,301; failed to import 181.
181 documents not found in database.
Created 53 new sources.
Created 8,762 new footnotes; updated 1,068.
Saved report of documents not found to /tmp/documents_not_found_in_links.csv.

Please confirm that footnote locations are now set according to the logic we agreed on.

Here's the csv file of documents not found, so you can check that it's appropriate for later data work/review to try to identify these. documents_not_found_in_links.csv

richmanrachel commented 2 years ago

@rlskoeser - It looks good! Do I need to worry about the ignored 6,301? Or does that just mean they're duplicates of the existing footnotes?

rlskoeser commented 2 years ago

@richmanrachel the ignored ones are the link types we're not including in this script because we're handling them elsewhere in a different way (e.g. transcriptions is one category of these) — so no, nothing to worry about, it's just there for reporting purposes.

rlskoeser commented 2 years ago

@richmanrachel oh, please also confirm that you're able to edit the urls now without django flagging them as invalid

richmanrachel commented 2 years ago

It works! Closing :D