Closed thatbudakguy closed 3 years ago
I had thought that we created a name ontology for current/previous contributors to the PGP. Unfortunately it doesn't yet exist (and I may need to make it this week). For now, I'll link the new site page that has everyone's names so we can match initials to full names: https://genizalab.princeton.edu/people/current-team
Thanks @Stephanie for realizing I did remember correctly AND for finding the documentation! https://docs.google.com/spreadsheets/d/1MYTjlC7k3E0JY4HR7BZgdu5iKwW03wQbzvZQhhRllks/edit#gid=0
@mrustow @richmanrachel I'm starting to work on parsing the editor information and have a number of questions. Thought I would go ahead and share the things I've figured out so far, even though I will likely have more.
There are a lot of questions here! LMK if we should break these out or create asana tasks to deal with any of them.
Source
model for tracking scholarship records! Where should I record the degree granting institution? Should we add a publisher field? 1. Is it ok to ignore entries with these values? If not, what should be done with them? awaiting transcription - OK to ignore Transcription listed in FGP, awaiting digitization on PGP - OK to ignore Source of transcription not noted in original PGP database. - not sure yet, digging into these now "yes" - OK to ignore
"Source of transcription not noted in original PGP database": this can be a task for Asana (good for Abigail). I'll write it up now. If all goes well with said task, this notation will disappear.
can you write me an Asana task to finish the rest of these? I can't access the CDH project on Asana for some reason (I think I might need to merge two accounts? not sure)
Some entries do not start with "Ed. ", is there any significance to that? They appear to have been notations about texts that were edited but not in PGP. Theoretically this information shouldn't have appeared in this field. I fixed some, and reached out to Alan to fix the rest and move the info to the notes2 field for later harvesting.
Unpublished materials:
Maybe others?
Articles: these should all have quotation marks around the titles.
No blogs/websites that I know of.
Speaking of dissertations: already found limitations with our current Source model for tracking scholarship records! Where should I record the degree granting institution?
Do we need to record it at all? We're posting most if not all dissertations we cite to the PGL bibliography page, so maybe not necessary
Should we add a publisher field?
My first instinct: Nah. Make people look it up themselves. My second thought: depends what the field is for. Do we want to provide people with ready-made citations, or help them track down the scholarship? If the latter, no publisher needed. If the former, publisher needed for maybe 75% of academic presses (and that proportion is decreasing).
There are a number of odd-looking entries that I'm not sure what to do with. I'm hoping that the demerge work will address some of them. Here's a sample, please advise:
These were idiosyncratic uses or formulations of the "Edited by" field. I've written to Alan about them. The joins info belongs in the description field. And oh look, that Arad article should be in quotation marks hmmmm
ENA 2808.41a ed. Alan Elbaum, 2020. Ditto. Presumably Alan did this to differentiate the document under that shelfmark that he edited from the one he didn't, but shouldn't he have done that by making separate entries?
see Zinger. Ginzei Qedem 13 Possibly a reference to an article (no title though, ouch) in which an edition appeared, but not an edition that is actually in PGP — idiosyncratic.
A. L. Udovitch or Mark Cohen This sounds like me. Will check checked & fixed (it no longer says that)
What should we do with the entries that have google docs links? That is an excellent question. Those are editions that are finished but sitting in the pipeline and not yet displayed on PGP. I can't remember where we stand on the transcription bottleneck question.
A number of entries include language like "and trans." — does that indicate the source provides both an edition and a translation? yes.
Is the doc # notation on many entries similar to a page number, i.e. providing where to find the transcription of this document? Examples:
- Doc. H-7, pp. 291-293.
- Ṣabīḥ ʿAodeh, "Eleventh Century Arabic Letters of Jewish Merchants from the Cairo Geniza" (PhD diss., Tel Aviv University, 1992), doc. 54
yes. i adopted the rule that when a set of published editions assigns document numbers, we should use them instead of page numbers because people tend to remember doc numbers better than page numbers.
@mrustow thank you for all these amazingly helpful answers.
Small follow-up detail question: should the source type for Goitein's India Book 6 (unpublished) be Book or Unpublished?
Unpublished. Thanks for catching that!
I've started looking at the translator field and have some questions about some of the things I'm finding.
Let me know if lists or examples of any of these would be helpful.
@mrustow @richmanrachel I posted my preliminary list of questions about the translator field but forgot to tag anyone! 🤦♀️
I've started looking at the translator field and have some questions about some of the things I'm finding.
- 39 entries include a google docs / google drive link. Do you have plans for how making these available?
A handful of entries specify the language the text is translated into ("Trans. into Hebrew by Zinger", "Trans. into English, Cohen.", "Trans. Werner Diem (into German)".). Two questions:
- If the language is not specified, can we assume English?
- Is the language of the translation the same as the language of the source text it comes from?
- Some records specify partial translation; another is listed as an annotated translation. Should I preserve that information in a note on the footnote documenting the translation?
- several entries include "Translation awaiting digitization on PGP." or some variation of that language; what do we do about that?
- One odd entry: Rustow, PGP NEED TO DO : [hādha a]l-sijill al-manshūr fī al-aʿmā[l]
Let me know if lists or examples of any of these would be helpful.
Information and decisions from discussion at 2021-04-15 meeting:
We may want to think about a filter or feed to make it easy to find transcriptions with/without digital edition, similar to the needs review feature; perhaps it could serve as a way to help manage the queue for transcription digitization work.
Increasing from 5 points to 8 due to complexity
@richmanrachel @mrustow this one is ready for first round of testing! There is still work to be done on this, but would be great to get your feedback on how it's working so far. I put a testing check list in the issue description, but it may not cover everything.
The things I know that aren't handled yet:
I also have a document with records that need cleanup or things I have questions about how to handle: https://docs.google.com/spreadsheets/d/1v8uaX9a4cxXVW5oFUHOr9lVP4JFU_ej8voJtf_d3CkQ/edit#gid=0
Please go ahead and note other oddities you notice when you test, because I'm sure I'm not aware of all of them.
I'm not sure how well I can test this until we clean up some of the data. For example, the case sensitive search is creating problems like this:
I love that clicking on the number of footnotes lands me on this page:
Could we possibly switch the placement of the Volumes column and the Year? We have lots of multi-volume works, and so it's currently a bit confusing:
check a variety of source document types to see if they are recognized correctly from the citation and content: unpublished, book, article, dissertation
- Overall this seems to be working, but there are more types being put into "Books" (probably because of irregular data entry): ![Uploading image.png…]()
Thanks for the testing and feedback so far @richmanrachel — agree you can't fully test it yet, but anything you can help identify now to help clean it is valuable.
Could we possibly switch the placement of the Volumes column and the Year? We have lots of multi-volume works, and so it's currently a bit confusing:
Absolutely. What about edition? I wasn't sure what information (if any) in current citations would go there, so I'm currently not populating it. Should we keep it on the list view?
Yes, "Books" is currently the fallback type if no other type is recognized. Do you have examples of specific types I'm not identifying correctly? Are they types we included?
Good call on the case-sensitive mismatching... I wonder if we might be able to use OpenRefine title clustering to help with this...
clustering would also get other examples like Arabic Legal and Administrative Documents
/Arabic Administrative and Legal Documents
, I think
@rlskoeser
What about edition? I wasn't sure what information (if any) in current citations would go there, so I'm currently not populating it. Should we keep it on the list view?
I don't think we have many books with editions yet... @mrustow agrees that we can drop Edition as a category.
Do you have examples of specific types I'm not identifying correctly? Are they types we included?
- Yes, Weiss "Halfon" is a dissertation; Alan hasn't written any books (that I'm aware of!); Goitein "Tarbiz" refers to a journal article, etc. Should I set this as a data cleaning task?
footnote includes page number, document number, and/or Goitein section mark (? numbers with Hebrew characters) in the location if specified in the editor field
- Unclear if all of it is uploading properly. For example, PGP ID 6121: The description and footnote information don't completely match.
multiple references to the same source should be aggregated (you can use the source list and footnote count to check how well this is working; probably won't work in all cases)
- Definitely a lot of repetition of sources still. Over 54 cases of
editor information and footnote are displayed on the public document detail page (document 'view on site')
- It shows up but doesn't contain the edition/translation/discussion information
[Will resume testing later. Haven't checked the URL one yet and will continue from there]
- Yes, Weiss "Halfon" is a dissertation; Alan hasn't written any books (that I'm aware of!); Goitein "Tarbiz" refers to a journal article, etc. Should I set this as a data cleaning task?
@richmanrachel yes, I think data cleaning makes sense for these. I'm looking for key term "diss" to recognize dissertations, as that seemed to be used elsewhere and not occur in any non-dissertation records. For articles, I'm looking for quotes (single or double) around the title — I see some of the Tarbiz references don't have that format.
If we're doing another round of data cleaning, I'd like to request that partial transcriptions be reformatted so that the partial information can be picked up as a note, similar to other notes — occurring after the citation, but we could discuss exact format and I can test to make sure I can pick it up before they're changed. (There are only 34 of these)
- It shows up but doesn't contain the edition/translation/discussion information
Good point, @richmanrachel ! This is my interpretation of @gissoo's design — I was thinking the full details of edition/translation/discussion information would go on the scholarship records page, which we haven't implemented yet.
Technically, should we only show editor information on this page for a transcription we have on the PGP site? (I didn't worry about that because we don't have that information yet, but would be good to know if I'm thinking about it correctly).
@richmanrachel @thatbudakguy I created quick CSV exports of sources and footnotes currently in the test database.
if there is a url, that url should be set on the source (but now I wonder if that url should go on the footnote... for a source with multiple urls, if that occurs, url will only be set from the first occurrence)
- I agree we need to do something differently here, because right now the link just totally disappears.
@rlskoeser or @thatbudakguy - what do you mean by " source record and footnote are created similar to editor field"?
@richmanrachel 😆 oh no, I have no idea! wow, that is unclear
@rlskoeser - haha, no worries. I'm also not sure how to test "document relation includes both edition and translation"
But that finishes my first pass through... 3 or 4 hours later?!
@richmanrachel yeah, this is a rough one — I've spent a good chunk of the day trying to refine it and making a document with some of the cleanup that needs to be done!
Oh! I see now — the last two items pertain to the translator field, but the formatting in the note is definitely NOT making that clear (revised to try to make it better). With that context, do the instructions make more sense? Information from the translator column should be imported similar to editor, and entries from the translator column should have both edition and translation document relation set.
@rlskoeser - Great, that does work!
Have made some improvements from initial testing and feedback, and ran a fresh data import on the test site. Attaching csv exports with the source and footnote data from this latest run.
Closing this per conversation with @richmanrachel as a first pass. Will create new issues to track follow up work as needed, depending on possible change to approach using this bibliographic dataset for published content up to 2016 https://www.repository.cam.ac.uk/handle/1810/256117
🤦♀️ said I was closing this but did not actually close. I'll go ahead and close now, and we can create the new issues when we figure out the appropriate next steps.
testing notes
The import script should now import editor and translator information from the metadata spreadsheet. Please note that as of yet (April 29) it does not handle all cases — there is more work needed, probably some combination of manual cleanup and/or adding logic for more variations.
check a variety of documents with editor information, and confirm that:
check a variety of documents with translator information, and confirm that:
Is your feature request related to a problem? Please describe. The primary motivation for this is so that researchers can find documents that are translated/edited by "trustworthy" sources.
Another major motivation noted by Alan is to recognize the ongoing labor of the geniza team who edit the transcriptions; sometimes data workers mark themselves as editors but sometimes not. There is a risk that some researchers may search for Goitein's transcriptions under the assumption that his work will be closest to "the truth", without realizing that those transcriptions have themselves later been improved upon by generations of PGP work. We want to be able to surface the work of the PGP team as editors.
Describe the solution you'd like The data import script should create Person records for each person listed as translator or editor for a Document, and add them as translator or editor for that Document as appropriate.
Additional context One possible future improvement suggested by Alan was looking at the edit history for the Bitbucket transcriptions to automatically harvest additional data about who has edited the transcriptions, so that we have a fuller record of editorship for some Documents.
dev notes
page_range
tolocation
and revise help textrevisions after first-round testing: