Princeton-CDH / geniza

version 4.x of the Princeton Geniza Project
https://geniza.princeton.edu
Apache License 2.0
11 stars 2 forks source link

As a content editor, I want a one time import of the translator and editor information so I know which scholars have transcribed or translated a document. (first pass) #108

Closed thatbudakguy closed 3 years ago

thatbudakguy commented 3 years ago

testing notes

The import script should now import editor and translator information from the metadata spreadsheet. Please note that as of yet (April 29) it does not handle all cases — there is more work needed, probably some combination of manual cleanup and/or adding logic for more variations.

check a variety of documents with editor information, and confirm that:

check a variety of documents with translator information, and confirm that:


Is your feature request related to a problem? Please describe. The primary motivation for this is so that researchers can find documents that are translated/edited by "trustworthy" sources.

Another major motivation noted by Alan is to recognize the ongoing labor of the geniza team who edit the transcriptions; sometimes data workers mark themselves as editors but sometimes not. There is a risk that some researchers may search for Goitein's transcriptions under the assumption that his work will be closest to "the truth", without realizing that those transcriptions have themselves later been improved upon by generations of PGP work. We want to be able to surface the work of the PGP team as editors.

Describe the solution you'd like The data import script should create Person records for each person listed as translator or editor for a Document, and add them as translator or editor for that Document as appropriate.

Additional context One possible future improvement suggested by Alan was looking at the edit history for the Bitbucket transcriptions to automatically harvest additional data about who has edited the transcriptions, so that we have a fuller record of editorship for some Documents.

dev notes

revisions after first-round testing:

richmanrachel commented 3 years ago

I had thought that we created a name ontology for current/previous contributors to the PGP. Unfortunately it doesn't yet exist (and I may need to make it this week). For now, I'll link the new site page that has everyone's names so we can match initials to full names: https://genizalab.princeton.edu/people/current-team

richmanrachel commented 3 years ago

Thanks @Stephanie for realizing I did remember correctly AND for finding the documentation! https://docs.google.com/spreadsheets/d/1MYTjlC7k3E0JY4HR7BZgdu5iKwW03wQbzvZQhhRllks/edit#gid=0

rlskoeser commented 3 years ago

@mrustow @richmanrachel I'm starting to work on parsing the editor information and have a number of questions. Thought I would go ahead and share the things I've figured out so far, even though I will likely have more.

There are a lot of questions here! LMK if we should break these out or create asana tasks to deal with any of them.

  1. Is it ok to ignore entries with these values? If not, what should be done with them?
    • awaiting transcription
    • Transcription listed in FGP, awaiting digitization on PGP
    • Source of transcription not noted in original PGP database.
    • "yes"
  2. Some entries do not start with "Ed. ", is there any significance to that? Examples:
    • Kiraz, G. A. (2018). A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30. [Genizah Research Unit, Fragment of the Month, August 2018]. https://doi.org/10.17863/CAM.34049
    • Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54
    • Gil, Kingdom, vol. 4, #798, awaiting digitization on PGP
    • Y. David, Divorce among the Jews, p. 161
    • Y. David, The divorce among the Jews, 201-204
  3. I remember Marina saying a while back that it is easy to determine the type of source based on the citation. I can identify dissertations easily by the text "diss", and they have a distinct and consistent citation I can parse; wondering if you can advise on easy ways to identify the other source types. e.g., is there a list of keywords I can check for unpublished materials? (unpublished, typed texts, ...?). Is there an easy way to identify articles? Does the data currently reference any blog/websites?
  4. Speaking of dissertations: already found limitations with our current Source model for tracking scholarship records! Where should I record the degree granting institution? Should we add a publisher field?
  5. There are a number of odd-looking entries that I'm not sure what to do with. I'm hoping that the demerge work will address some of them. Here's a sample, please advise:
    • Partially ed. Goitein, typed texts. Join by Oded Zinger. Awaits full transcription.
    • Notebook assembled by Dotan Arad, see Arad (2017) Welfare and Charity in a Sixteenth-Century Jewish Community in Egypt: A Study of Genizah Documents, Al-Masāq, 29:3, 258–72, no. 19.
    • Letter ed. Alan Elbaum, 10/2020. Legal document awaiting transcription.
    • Partially ed. Goitein, typed texts. Join by Oded Zinger. Awaits full transcription.
    • ENA 2808.41a ed. Alan Elbaum, 2020.
    • see Zinger. Ginzei Qedem 13
    • A. L. Udovitch or Mark Cohen
  6. What should we do with the entries that have google docs links?
  7. A number of entries include language like "and trans." — does that indicate the source provides both an edition and a translation?
  8. Is the doc # notation on many entries similar to a page number, i.e. providing where to find the transcription of this document? Examples:
    • Doc. H-7, pp. 291-293.
    • Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54
  9. Another source data model question: do you ever include page range on article citations, or do we only need pages for the footnote (to identify where in the source the document is edited/transcribed/discussed)?
mrustow commented 3 years ago

1. Is it ok to ignore entries with these values? If not, what should be done with them? awaiting transcription - OK to ignore Transcription listed in FGP, awaiting digitization on PGP - OK to ignore Source of transcription not noted in original PGP database. - not sure yet, digging into these now "yes" - OK to ignore

mrustow commented 3 years ago

"Source of transcription not noted in original PGP database": this can be a task for Asana (good for Abigail). I'll write it up now. If all goes well with said task, this notation will disappear.

mrustow commented 3 years ago

can you write me an Asana task to finish the rest of these? I can't access the CDH project on Asana for some reason (I think I might need to merge two accounts? not sure)

mrustow commented 3 years ago

Some entries do not start with "Ed. ", is there any significance to that? They appear to have been notations about texts that were edited but not in PGP. Theoretically this information shouldn't have appeared in this field. I fixed some, and reached out to Alan to fix the rest and move the info to the notes2 field for later harvesting.

mrustow commented 3 years ago

Unpublished materials:

  1. Goitein, typed texts
  2. Name only, no title. Names include Mark R. Cohen or MRC A. L. Udovitch or ALU Alan Elbaum Arnold Franklin or AEF

Maybe others?

Articles: these should all have quotation marks around the titles.

No blogs/websites that I know of.

mrustow commented 3 years ago

Speaking of dissertations: already found limitations with our current Source model for tracking scholarship records! Where should I record the degree granting institution?

Do we need to record it at all? We're posting most if not all dissertations we cite to the PGL bibliography page, so maybe not necessary

Should we add a publisher field?

My first instinct: Nah. Make people look it up themselves. My second thought: depends what the field is for. Do we want to provide people with ready-made citations, or help them track down the scholarship? If the latter, no publisher needed. If the former, publisher needed for maybe 75% of academic presses (and that proportion is decreasing).

mrustow commented 3 years ago

There are a number of odd-looking entries that I'm not sure what to do with. I'm hoping that the demerge work will address some of them. Here's a sample, please advise:

These were idiosyncratic uses or formulations of the "Edited by" field. I've written to Alan about them. The joins info belongs in the description field. And oh look, that Arad article should be in quotation marks hmmmm

mrustow commented 3 years ago

What should we do with the entries that have google docs links? That is an excellent question. Those are editions that are finished but sitting in the pipeline and not yet displayed on PGP. I can't remember where we stand on the transcription bottleneck question.

mrustow commented 3 years ago

A number of entries include language like "and trans." — does that indicate the source provides both an edition and a translation? yes.

mrustow commented 3 years ago

Is the doc # notation on many entries similar to a page number, i.e. providing where to find the transcription of this document? Examples:

  • Doc. H-7, pp. 291-293.
  • Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54

yes. i adopted the rule that when a set of published editions assigns document numbers, we should use them instead of page numbers because people tend to remember doc numbers better than page numbers.

rlskoeser commented 3 years ago

@mrustow thank you for all these amazingly helpful answers.

Small follow-up detail question: should the source type for Goitein's India Book 6 (unpublished) be Book or Unpublished?

mrustow commented 3 years ago

Unpublished. Thanks for catching that!

rlskoeser commented 3 years ago

I've started looking at the translator field and have some questions about some of the things I'm finding.

Let me know if lists or examples of any of these would be helpful.

rlskoeser commented 3 years ago

@mrustow @richmanrachel I posted my preliminary list of questions about the translator field but forgot to tag anyone! 🤦‍♀️

I've started looking at the translator field and have some questions about some of the things I'm finding.

  • 39 entries include a google docs / google drive link. Do you have plans for how making these available?
  • A handful of entries specify the language the text is translated into ("Trans. into Hebrew by Zinger", "Trans. into English, Cohen.", "Trans. Werner Diem (into German)".). Two questions:

    • If the language is not specified, can we assume English?
    • Is the language of the translation the same as the language of the source text it comes from?
  • Some records specify partial translation; another is listed as an annotated translation. Should I preserve that information in a note on the footnote documenting the translation?
  • several entries include "Translation awaiting digitization on PGP." or some variation of that language; what do we do about that?
  • One odd entry: Rustow, PGP NEED TO DO : [hādha a]l-sijill al-manshūr fī al-aʿmā[l]

Let me know if lists or examples of any of these would be helpful.

rlskoeser commented 3 years ago

Information and decisions from discussion at 2021-04-15 meeting:

We may want to think about a filter or feed to make it easy to find transcriptions with/without digital edition, similar to the needs review feature; perhaps it could serve as a way to help manage the queue for transcription digitization work.

rlskoeser commented 3 years ago

Increasing from 5 points to 8 due to complexity

rlskoeser commented 3 years ago

@richmanrachel @mrustow this one is ready for first round of testing! There is still work to be done on this, but would be great to get your feedback on how it's working so far. I put a testing check list in the issue description, but it may not cover everything.

The things I know that aren't handled yet:

I also have a document with records that need cleanup or things I have questions about how to handle: https://docs.google.com/spreadsheets/d/1v8uaX9a4cxXVW5oFUHOr9lVP4JFU_ej8voJtf_d3CkQ/edit#gid=0

Please go ahead and note other oddities you notice when you test, because I'm sure I'm not aware of all of them.

richmanrachel commented 3 years ago

I'm not sure how well I can test this until we clean up some of the data. For example, the case sensitive search is creating problems like this: image

richmanrachel commented 3 years ago

I love that clicking on the number of footnotes lands me on this page: image

richmanrachel commented 3 years ago

Could we possibly switch the placement of the Volumes column and the Year? We have lots of multi-volume works, and so it's currently a bit confusing: image

richmanrachel commented 3 years ago

check a variety of source document types to see if they are recognized correctly from the citation and content: unpublished, book, article, dissertation

  • Overall this seems to be working, but there are more types being put into "Books" (probably because of irregular data entry): ![Uploading image.png…]()
rlskoeser commented 3 years ago

Thanks for the testing and feedback so far @richmanrachel — agree you can't fully test it yet, but anything you can help identify now to help clean it is valuable.

Could we possibly switch the placement of the Volumes column and the Year? We have lots of multi-volume works, and so it's currently a bit confusing:

Absolutely. What about edition? I wasn't sure what information (if any) in current citations would go there, so I'm currently not populating it. Should we keep it on the list view?

Yes, "Books" is currently the fallback type if no other type is recognized. Do you have examples of specific types I'm not identifying correctly? Are they types we included?

Good call on the case-sensitive mismatching... I wonder if we might be able to use OpenRefine title clustering to help with this...

thatbudakguy commented 3 years ago

clustering would also get other examples like Arabic Legal and Administrative Documents/Arabic Administrative and Legal Documents, I think

richmanrachel commented 3 years ago

@rlskoeser

What about edition? I wasn't sure what information (if any) in current citations would go there, so I'm currently not populating it. Should we keep it on the list view?

  • I don't think we have many books with editions yet... @mrustow agrees that we can drop Edition as a category.

    Do you have examples of specific types I'm not identifying correctly? Are they types we included?

  • Yes, Weiss "Halfon" is a dissertation; Alan hasn't written any books (that I'm aware of!); Goitein "Tarbiz" refers to a journal article, etc. Should I set this as a data cleaning task?
richmanrachel commented 3 years ago

footnote includes page number, document number, and/or Goitein section mark (? numbers with Hebrew characters) in the location if specified in the editor field

  • Unclear if all of it is uploading properly. For example, PGP ID 6121: image image The description and footnote information don't completely match.
richmanrachel commented 3 years ago

multiple references to the same source should be aggregated (you can use the source list and footnote count to check how well this is working; probably won't work in all cases)

  • Definitely a lot of repetition of sources still. Over 54 cases of image
richmanrachel commented 3 years ago

editor information and footnote are displayed on the public document detail page (document 'view on site')

  • It shows up but doesn't contain the edition/translation/discussion information image

[Will resume testing later. Haven't checked the URL one yet and will continue from there]

rlskoeser commented 3 years ago
  • Yes, Weiss "Halfon" is a dissertation; Alan hasn't written any books (that I'm aware of!); Goitein "Tarbiz" refers to a journal article, etc. Should I set this as a data cleaning task?

@richmanrachel yes, I think data cleaning makes sense for these. I'm looking for key term "diss" to recognize dissertations, as that seemed to be used elsewhere and not occur in any non-dissertation records. For articles, I'm looking for quotes (single or double) around the title — I see some of the Tarbiz references don't have that format.

If we're doing another round of data cleaning, I'd like to request that partial transcriptions be reformatted so that the partial information can be picked up as a note, similar to other notes — occurring after the citation, but we could discuss exact format and I can test to make sure I can pick it up before they're changed. (There are only 34 of these)

rlskoeser commented 3 years ago
  • It shows up but doesn't contain the edition/translation/discussion information

Good point, @richmanrachel ! This is my interpretation of @gissoo's design — I was thinking the full details of edition/translation/discussion information would go on the scholarship records page, which we haven't implemented yet.

Technically, should we only show editor information on this page for a transcription we have on the PGP site? (I didn't worry about that because we don't have that information yet, but would be good to know if I'm thinking about it correctly).

rlskoeser commented 3 years ago

@richmanrachel @thatbudakguy I created quick CSV exports of sources and footnotes currently in the test database.

geniza-source-import.csv geniza-footnote-import.csv

richmanrachel commented 3 years ago

if there is a url, that url should be set on the source (but now I wonder if that url should go on the footnote... for a source with multiple urls, if that occurs, url will only be set from the first occurrence)

  • I agree we need to do something differently here, because right now the link just totally disappears.
richmanrachel commented 3 years ago

@rlskoeser or @thatbudakguy - what do you mean by " source record and footnote are created similar to editor field"?

rlskoeser commented 3 years ago

@richmanrachel 😆 oh no, I have no idea! wow, that is unclear

richmanrachel commented 3 years ago

@rlskoeser - haha, no worries. I'm also not sure how to test "document relation includes both edition and translation"

But that finishes my first pass through... 3 or 4 hours later?!

rlskoeser commented 3 years ago

@richmanrachel yeah, this is a rough one — I've spent a good chunk of the day trying to refine it and making a document with some of the cleanup that needs to be done!

Oh! I see now — the last two items pertain to the translator field, but the formatting in the note is definitely NOT making that clear (revised to try to make it better). With that context, do the instructions make more sense? Information from the translator column should be imported similar to editor, and entries from the translator column should have both edition and translation document relation set.

richmanrachel commented 3 years ago

@rlskoeser - Great, that does work!

rlskoeser commented 3 years ago

Have made some improvements from initial testing and feedback, and ran a fresh data import on the test site. Attaching csv exports with the source and footnote data from this latest run.

geniza-footnote-import.csv geniza-source-import.csv

rlskoeser commented 3 years ago

Closing this per conversation with @richmanrachel as a first pass. Will create new issues to track follow up work as needed, depending on possible change to approach using this bibliographic dataset for published content up to 2016 https://www.repository.cam.ac.uk/handle/1810/256117

rlskoeser commented 3 years ago

🤦‍♀️ said I was closing this but did not actually close. I'll go ahead and close now, and we can create the new issues when we figure out the appropriate next steps.