As a content editor, I want a one time import of the translator and editor information so I know which scholars have transcribed or translated a document. (first pass)

thatbudakguy commented 3 years ago

testing notes

The import script should now import editor and translator information from the metadata spreadsheet. Please note that as of yet (April 29) it does not handle all cases — there is more work needed, probably some combination of manual cleanup and/or adding logic for more variations.

check a variety of documents with editor information, and confirm that:

[x] a source record has been created with the author and title (if any) of the reference; year should be set if known
[ ] check a variety of source document types to see if they are recognized correctly from the citation and content: unpublished, book, article, dissertation
[x] a footnote has been created linking the source to the document
[x] footnote document relationship is set to edition
[ ] footnote includes page number, document number, and/or Goitein section mark (? numbers with Hebrew characters) in the location if specified in the editor field
[x] footnote includes notes in some cases (currently expects to handle things like: "with emendations by", "see attached", "transcription awaiting", "edited here", multiword parenthetical statements — but does not handle all cases!)
[x] documents with multiple editions (delimited by "; also .ed") get multiple footnotes
[x] should handle documents with multiple authors, and preserve author order
[ ] multiple references to the same source should be aggregated (you can use the source list and footnote count to check how well this is working; probably won't work in all cases)
[x] if the text includes "and trans." or "and transl." the document relation type should include transcription
[ ] if there is a url, that url should be set on the source (but now I wonder if that url should go on the footnote... for a source with multiple urls, if that occurs, url will only be set from the first occurrence)
[x] language should be set on the source if specified (likely does not handle all cases)
[ ] editor information and footnote are displayed on the public document detail page (document 'view on site')
[x] PGP editions (indicated by no title) should result in sources with source type unpublished, and should have author name(s), no title, and year if specified; multiple PGP editions by the same author in the same year (or with no year) should result in multiple footnotes associated with the same source

check a variety of documents with translator information, and confirm that:

[x] source record and footnote are created similar to editor field
[x] document relation includes both edition and translation

Is your feature request related to a problem? Please describe. The primary motivation for this is so that researchers can find documents that are translated/edited by "trustworthy" sources.

Another major motivation noted by Alan is to recognize the ongoing labor of the geniza team who edit the transcriptions; sometimes data workers mark themselves as editors but sometimes not. There is a risk that some researchers may search for Goitein's transcriptions under the assumption that his work will be closest to "the truth", without realizing that those transcriptions have themselves later been improved upon by generations of PGP work. We want to be able to surface the work of the PGP team as editors.

Describe the solution you'd like The data import script should create Person records for each person listed as translator or editor for a Document, and add them as translator or editor for that Document as appropriate.

Additional context One possible future improvement suggested by Alan was looking at the edit history for the Bitbucket transcriptions to automatically harvest additional data about who has edited the transcriptions, so that we have a fuller record of editorship for some Documents.

dev notes

editor:
- [x] add method to parse editor information
- [x] create or find creator record for edition author
- [x] create or find source record for edition
- [x] create footnote linking source to document with doc relation type set to Edition
- [x] if edition includes "and trans." also set doc relation type translation
- [x] rename footnote page_range to location and revise help text
- [x] put document number and page numbers in footnote location
- [x] put information about emendations in footnote notes
- [ ] put information about partial transcription in footnote notes
- ~pull content from transcription data file if present~ (moved to issue #146 )
translator:
- [x] add method to parse translator information
- [x] create or find creator record for translation author
- [x] create or find source record for translation
- [x] create footnote linking source to document with doc relation type set to translation

revisions after first-round testing:

[x] remove edition from list view
[x] move volume before year in list view
[x] handle more variations with volume notation
[ ] handle two Weiss authors; per Alan: "Meir Weiss is only relevant for 29620. Some early 20th century guy. The rest are G./Gershon"
[x] case-insensitive match on title when looking for existing sources
[x] improve year matching to avoid picking up historic dates in titles or partial shelfmarks

richmanrachel commented 3 years ago

I had thought that we created a name ontology for current/previous contributors to the PGP. Unfortunately it doesn't yet exist (and I may need to make it this week). For now, I'll link the new site page that has everyone's names so we can match initials to full names: https://genizalab.princeton.edu/people/current-team

richmanrachel commented 3 years ago

Thanks @Stephanie for realizing I did remember correctly AND for finding the documentation! https://docs.google.com/spreadsheets/d/1MYTjlC7k3E0JY4HR7BZgdu5iKwW03wQbzvZQhhRllks/edit#gid=0

rlskoeser commented 3 years ago

@mrustow @richmanrachel I'm starting to work on parsing the editor information and have a number of questions. Thought I would go ahead and share the things I've figured out so far, even though I will likely have more.

There are a lot of questions here! LMK if we should break these out or create asana tasks to deal with any of them.

Is it ok to ignore entries with these values? If not, what should be done with them?
- awaiting transcription
- Transcription listed in FGP, awaiting digitization on PGP
- Source of transcription not noted in original PGP database.
- "yes"
Some entries do not start with "Ed. ", is there any significance to that? Examples:
- Kiraz, G. A. (2018). A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30. [Genizah Research Unit, Fragment of the Month, August 2018]. https://doi.org/10.17863/CAM.34049
- Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54
- Gil, Kingdom, vol. 4, #798, awaiting digitization on PGP
- Y. David, Divorce among the Jews, p. 161
- Y. David, The divorce among the Jews, 201-204
I remember Marina saying a while back that it is easy to determine the type of source based on the citation. I can identify dissertations easily by the text "diss", and they have a distinct and consistent citation I can parse; wondering if you can advise on easy ways to identify the other source types. e.g., is there a list of keywords I can check for unpublished materials? (unpublished, typed texts, ...?). Is there an easy way to identify articles? Does the data currently reference any blog/websites?
Speaking of dissertations: already found limitations with our current Source model for tracking scholarship records! Where should I record the degree granting institution? Should we add a publisher field?
There are a number of odd-looking entries that I'm not sure what to do with. I'm hoping that the demerge work will address some of them. Here's a sample, please advise:
- Partially ed. Goitein, typed texts. Join by Oded Zinger. Awaits full transcription.
- Notebook assembled by Dotan Arad, see Arad (2017) Welfare and Charity in a Sixteenth-Century Jewish Community in Egypt: A Study of Genizah Documents, Al-Masāq, 29:3, 258–72, no. 19.
- Letter ed. Alan Elbaum, 10/2020. Legal document awaiting transcription.
- Partially ed. Goitein, typed texts. Join by Oded Zinger. Awaits full transcription.
- ENA 2808.41a ed. Alan Elbaum, 2020.
- see Zinger. Ginzei Qedem 13
- A. L. Udovitch or Mark Cohen
What should we do with the entries that have google docs links?
A number of entries include language like "and trans." — does that indicate the source provides both an edition and a translation?
Is the doc # notation on many entries similar to a page number, i.e. providing where to find the transcription of this document? Examples:
- Doc. H-7, pp. 291-293.
- Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54
Another source data model question: do you ever include page range on article citations, or do we only need pages for the footnote (to identify where in the source the document is edited/transcribed/discussed)?

mrustow commented 3 years ago

1. Is it ok to ignore entries with these values? If not, what should be done with them? awaiting transcription - OK to ignore Transcription listed in FGP, awaiting digitization on PGP - OK to ignore Source of transcription not noted in original PGP database. - not sure yet, digging into these now "yes" - OK to ignore

mrustow commented 3 years ago

"Source of transcription not noted in original PGP database": this can be a task for Asana (good for Abigail). I'll write it up now. If all goes well with said task, this notation will disappear.

mrustow commented 3 years ago

can you write me an Asana task to finish the rest of these? I can't access the CDH project on Asana for some reason (I think I might need to merge two accounts? not sure)

mrustow commented 3 years ago

Some entries do not start with "Ed. ", is there any significance to that? They appear to have been notations about texts that were edited but not in PGP. Theoretically this information shouldn't have appeared in this field. I fixed some, and reached out to Alan to fix the rest and move the info to the notes2 field for later harvesting.

mrustow commented 3 years ago

Unpublished materials:

Goitein, typed texts
Name only, no title. Names include Mark R. Cohen or MRC A. L. Udovitch or ALU Alan Elbaum Arnold Franklin or AEF

Maybe others?

Articles: these should all have quotation marks around the titles.

No blogs/websites that I know of.

mrustow commented 3 years ago

Speaking of dissertations: already found limitations with our current Source model for tracking scholarship records! Where should I record the degree granting institution?

Do we need to record it at all? We're posting most if not all dissertations we cite to the PGL bibliography page, so maybe not necessary

Should we add a publisher field?

My first instinct: Nah. Make people look it up themselves. My second thought: depends what the field is for. Do we want to provide people with ready-made citations, or help them track down the scholarship? If the latter, no publisher needed. If the former, publisher needed for maybe 75% of academic presses (and that proportion is decreasing).

mrustow commented 3 years ago

There are a number of odd-looking entries that I'm not sure what to do with. I'm hoping that the demerge work will address some of them. Here's a sample, please advise:

Partially ed. Goitein, typed texts. Join by Oded Zinger. Awaits full transcription.
Notebook assembled by Dotan Arad, see Arad (2017) Welfare and Charity in a Sixteenth-Century Jewish Community in Egypt: A Study of Genizah Documents, Al-Masāq, 29:3, 258–72, no. 19.
Letter ed. Alan Elbaum, 10/2020. Legal document awaiting transcription.
Partially ed. Goitein, typed texts. Join by Oded Zinger. Awaits full transcription.**

These were idiosyncratic uses or formulations of the "Edited by" field. I've written to Alan about them. The joins info belongs in the description field. And oh look, that Arad article should be in quotation marks hmmmm

ENA 2808.41a ed. Alan Elbaum, 2020. Ditto. Presumably Alan did this to differentiate the document under that shelfmark that he edited from the one he didn't, but shouldn't he have done that by making separate entries?
see Zinger. Ginzei Qedem 13 Possibly a reference to an article (no title though, ouch) in which an edition appeared, but not an edition that is actually in PGP — idiosyncratic.
A. L. Udovitch or Mark Cohen This sounds like me. Will check checked & fixed (it no longer says that)

mrustow commented 3 years ago

What should we do with the entries that have google docs links? That is an excellent question. Those are editions that are finished but sitting in the pipeline and not yet displayed on PGP. I can't remember where we stand on the transcription bottleneck question.

mrustow commented 3 years ago

A number of entries include language like "and trans." — does that indicate the source provides both an edition and a translation? yes.

mrustow commented 3 years ago

Is the doc # notation on many entries similar to a page number, i.e. providing where to find the transcription of this document? Examples:

Doc. H-7, pp. 291-293.

Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54

yes. i adopted the rule that when a set of published editions assigns document numbers, we should use them instead of page numbers because people tend to remember doc numbers better than page numbers.

rlskoeser commented 3 years ago

@mrustow thank you for all these amazingly helpful answers.

Small follow-up detail question: should the source type for Goitein's India Book 6 (unpublished) be Book or Unpublished?

mrustow commented 3 years ago

Unpublished. Thanks for catching that!

rlskoeser commented 3 years ago

I've started looking at the translator field and have some questions about some of the things I'm finding.

39 entries include a google docs / google drive link. Do you have plans for how making these available?
A handful of entries specify the language the text is translated into ("Trans. into Hebrew by Zinger", "Trans. into English, Cohen.", "Trans. Werner Diem (into German)".). Two questions:
- If the language is not specified, can we assume English?
- Is the language of the translation the same as the language of the source text it comes from?
Some records specify partial translation; another is listed as an annotated translation. Should I preserve that information in a note on the footnote documenting the translation?
several entries include "Translation awaiting digitization on PGP." or some variation of that language; what do we do about that?
One odd entry: Rustow, PGP NEED TO DO : [hādha a]l-sijill al-manshūr fī al-aʿmā[l]

Let me know if lists or examples of any of these would be helpful.

rlskoeser commented 3 years ago

@mrustow @richmanrachel I posted my preliminary list of questions about the translator field but forgot to tag anyone! 🤦‍♀️

I've started looking at the translator field and have some questions about some of the things I'm finding.

39 entries include a google docs / google drive link. Do you have plans for how making these available?

A handful of entries specify the language the text is translated into ("Trans. into Hebrew by Zinger", "Trans. into English, Cohen.", "Trans. Werner Diem (into German)".). Two questions:

If the language is not specified, can we assume English?

Is the language of the translation the same as the language of the source text it comes from?

Some records specify partial translation; another is listed as an annotated translation. Should I preserve that information in a note on the footnote documenting the translation?

several entries include "Translation awaiting digitization on PGP." or some variation of that language; what do we do about that?

One odd entry: Rustow, PGP NEED TO DO : [hādha a]l-sijill al-manshūr fī al-aʿmā[l]

Let me know if lists or examples of any of these would be helpful.

rlskoeser commented 3 years ago

Information and decisions from discussion at 2021-04-15 meeting:

edition metadata in the spreadsheet supersedes information in the TEI, which is likely out of date, except for records with "Source of original transcription not noted in PGP database" — which may be documented in TEI
best way for database to know if a transcription is available in PGP is for the text to be linked to the source record; for now, we'll add a url field and should store google links in this field; we'll also provisionally integrate the transcription text from the TEI (may be able to adapt from prototype work)
this may help us identify mismatches between TEI documents and PGPIDs, which we'll need to sort out anyway for the HTR project, so better to identify sooner anyway
everything in the translator column is also an edition (the separate column is a historical artifact based on the way data was added)

We may want to think about a filter or feed to make it easy to find transcriptions with/without digital edition, similar to the needs review feature; perhaps it could serve as a way to help manage the queue for transcription digitization work.

rlskoeser commented 3 years ago

Increasing from 5 points to 8 due to complexity

rlskoeser commented 3 years ago

@richmanrachel @mrustow this one is ready for first round of testing! There is still work to be done on this, but would be great to get your feedback on how it's working so far. I put a testing check list in the issue description, but it may not cover everything.

The things I know that aren't handled yet:

partial transcriptions
translation into [language] (currently we only have language on source; should we also add to footnote? can translation language be different from the source?)
identifying content that should go into notes is probably not working in all cases

I also have a document with records that need cleanup or things I have questions about how to handle: https://docs.google.com/spreadsheets/d/1v8uaX9a4cxXVW5oFUHOr9lVP4JFU_ej8voJtf_d3CkQ/edit#gid=0

Please go ahead and note other oddities you notice when you test, because I'm sure I'm not aware of all of them.

richmanrachel commented 3 years ago

I'm not sure how well I can test this until we clean up some of the data. For example, the case sensitive search is creating problems like this:

richmanrachel commented 3 years ago

I love that clicking on the number of footnotes lands me on this page:

richmanrachel commented 3 years ago

Could we possibly switch the placement of the Volumes column and the Year? We have lots of multi-volume works, and so it's currently a bit confusing:

richmanrachel commented 3 years ago

check a variety of source document types to see if they are recognized correctly from the citation and content: unpublished, book, article, dissertation

Overall this seems to be working, but there are more types being put into "Books" (probably because of irregular data entry): ![Uploading image.png…]()

rlskoeser commented 3 years ago

Thanks for the testing and feedback so far @richmanrachel — agree you can't fully test it yet, but anything you can help identify now to help clean it is valuable.

Could we possibly switch the placement of the Volumes column and the Year? We have lots of multi-volume works, and so it's currently a bit confusing:

Absolutely. What about edition? I wasn't sure what information (if any) in current citations would go there, so I'm currently not populating it. Should we keep it on the list view?

Yes, "Books" is currently the fallback type if no other type is recognized. Do you have examples of specific types I'm not identifying correctly? Are they types we included?

Good call on the case-sensitive mismatching... I wonder if we might be able to use OpenRefine title clustering to help with this...

thatbudakguy commented 3 years ago

clustering would also get other examples like Arabic Legal and Administrative Documents/Arabic Administrative and Legal Documents, I think

richmanrachel commented 3 years ago

@rlskoeser

What about edition? I wasn't sure what information (if any) in current citations would go there, so I'm currently not populating it. Should we keep it on the list view?

I don't think we have many books with editions yet... @mrustow agrees that we can drop Edition as a category.

Do you have examples of specific types I'm not identifying correctly? Are they types we included?

Yes, Weiss "Halfon" is a dissertation; Alan hasn't written any books (that I'm aware of!); Goitein "Tarbiz" refers to a journal article, etc. Should I set this as a data cleaning task?

richmanrachel commented 3 years ago

footnote includes page number, document number, and/or Goitein section mark (? numbers with Hebrew characters) in the location if specified in the editor field

Unclear if all of it is uploading properly. For example, PGP ID 6121: The description and footnote information don't completely match.

richmanrachel commented 3 years ago

multiple references to the same source should be aggregated (you can use the source list and footnote count to check how well this is working; probably won't work in all cases)

Definitely a lot of repetition of sources still. Over 54 cases of

richmanrachel commented 3 years ago

editor information and footnote are displayed on the public document detail page (document 'view on site')

It shows up but doesn't contain the edition/translation/discussion information

[Will resume testing later. Haven't checked the URL one yet and will continue from there]

rlskoeser commented 3 years ago

Yes, Weiss "Halfon" is a dissertation; Alan hasn't written any books (that I'm aware of!); Goitein "Tarbiz" refers to a journal article, etc. Should I set this as a data cleaning task?

@richmanrachel yes, I think data cleaning makes sense for these. I'm looking for key term "diss" to recognize dissertations, as that seemed to be used elsewhere and not occur in any non-dissertation records. For articles, I'm looking for quotes (single or double) around the title — I see some of the Tarbiz references don't have that format.

If we're doing another round of data cleaning, I'd like to request that partial transcriptions be reformatted so that the partial information can be picked up as a note, similar to other notes — occurring after the citation, but we could discuss exact format and I can test to make sure I can pick it up before they're changed. (There are only 34 of these)

rlskoeser commented 3 years ago

It shows up but doesn't contain the edition/translation/discussion information

Good point, @richmanrachel ! This is my interpretation of @gissoo's design — I was thinking the full details of edition/translation/discussion information would go on the scholarship records page, which we haven't implemented yet.

Technically, should we only show editor information on this page for a transcription we have on the PGP site? (I didn't worry about that because we don't have that information yet, but would be good to know if I'm thinking about it correctly).

rlskoeser commented 3 years ago

@richmanrachel @thatbudakguy I created quick CSV exports of sources and footnotes currently in the test database.

geniza-source-import.csv geniza-footnote-import.csv

richmanrachel commented 3 years ago

if there is a url, that url should be set on the source (but now I wonder if that url should go on the footnote... for a source with multiple urls, if that occurs, url will only be set from the first occurrence)

I agree we need to do something differently here, because right now the link just totally disappears.

richmanrachel commented 3 years ago

@rlskoeser or @thatbudakguy - what do you mean by " source record and footnote are created similar to editor field"?

rlskoeser commented 3 years ago

@richmanrachel 😆 oh no, I have no idea! wow, that is unclear

richmanrachel commented 3 years ago

@rlskoeser - haha, no worries. I'm also not sure how to test "document relation includes both edition and translation"

But that finishes my first pass through... 3 or 4 hours later?!

rlskoeser commented 3 years ago

@richmanrachel yeah, this is a rough one — I've spent a good chunk of the day trying to refine it and making a document with some of the cleanup that needs to be done!

Oh! I see now — the last two items pertain to the translator field, but the formatting in the note is definitely NOT making that clear (revised to try to make it better). With that context, do the instructions make more sense? Information from the translator column should be imported similar to editor, and entries from the translator column should have both edition and translation document relation set.

richmanrachel commented 3 years ago

@rlskoeser - Great, that does work!

rlskoeser commented 3 years ago

Have made some improvements from initial testing and feedback, and ran a fresh data import on the test site. Attaching csv exports with the source and footnote data from this latest run.

geniza-footnote-import.csv geniza-source-import.csv

rlskoeser commented 3 years ago

Closing this per conversation with @richmanrachel as a first pass. Will create new issues to track follow up work as needed, depending on possible change to approach using this bibliographic dataset for published content up to 2016 https://www.repository.cam.ac.uk/handle/1810/256117

rlskoeser commented 3 years ago

🤦‍♀️ said I was closing this but did not actually close. I'll go ahead and close now, and we can create the new issues when we figure out the appropriate next steps.

Princeton-CDH / geniza

As a content editor, I want a one time import of the translator and editor information so I know which scholars have transcribed or translated a document. (first pass) #108

testing notes

dev notes