Princeton-CDH / geniza

version 4.x of the Princeton Geniza Project
https://geniza.princeton.edu
Apache License 2.0
11 stars 2 forks source link

As a global admin, I want documents associated with language+script based on display name when importing documents from metadata spreadsheet. #106

Closed kmcelwee closed 3 years ago

kmcelwee commented 3 years ago

testing notes

Check that imported documents have the correct languages associated based on what is in the spreadsheet. Recommend checking a variety of things, including both normal cases including:

Also be sure to test the odder cases and outliers, including:

dev notes

rlskoeser commented 3 years ago

@richmanrachel I'm revising the document import to map languages from the spreadsheet to language+script based on display names, and have made the lookup case-insensitive as @kmcelwee recommended. That's resolved a larger number of the language mapping problems. Here are the few that are left, where we need help from you:

richmanrachel commented 3 years ago

@rlskoeser - a lot of this stuff turns out to be somewhat complicated:

Syriac: shows up frequently, but there is no matching display name. Which Syriac Language+Script combination should we map these to?

  • I think this may have to be done by a human with Syriac language training as I don't know which of the script styles are more common for this time/place. If it's easy for you to create a proper list of the ones that need to be checked, I'll try to find someone who can do this task.

Missing or not mapping properly: PGPID: 32605, Language: Amharic

  • I added Amharic to the language spreadsheet, but am waiting to hear back from an expert to know what to call the script.

PGPID: 30977, Language: Turkish

  • Added to spreadsheet (as it's in Latin script unlike other current Turkish entries), but we need to make a decision about whether to call this "Turkish" or "Modern Turkish"
  • PGPID: 29377, Language: Christian Palestinian Aramaic

  • Sent this to a friend who might help me figure out if this is a new language or counts as one of the versions of Syriac.
  • PGPID: 11135, Language: Sanskrit (I remember seeing some discussion of this! But don't know if it was resolved)

  • The resolution was that Sanskrit should not be a proper language for this documents (only referenced in the description, as it's on a separate piece of paper, not the original manuscript). I'll ask Abigail to change the metadata sheet.
  • coptic numerals — I remember Marina said it was complicated and we should discuss, can this go on the meeting agenda?

  • Yes, I added it to the agenda.
  • two cases of where language is one or another; could these be edited so they will be pulled in as probable languages? PGPID: 31083, Language: Hebrew or Judaeo-Arabic

  • It's almost impossible for me to read. Asking Marina now, but probable languages and listing them both will likely make sense.
  • PGPID: 31232, Language: Greek or Coptic

  • Just sent out an inquiry.
rlskoeser commented 3 years ago

@richmanrachel thanks for looking into all of these! Here are the documents where the Language field includes Syriac:

PGPID Shelfmark - Current Type Tags Description Language (optional)
31461 CUL Or.1080 14.71 Literary #Arabic literary #Syriac 19 small scraps, more or less neatly cut from a literary work. Many of them have Arabic on one side and Syriac (Garshuni script) on the other, with approximately the same line spacing and occasional use of red ink, suggesting that they belong to the same work or at least were written by the same scribe. The Arabic text on f.14v, one of the larger pieces, includes the sentence, "O Sergius, o most beloved of friends. I am abashed of. . . ." Needs further examination. Arabic; Syriac
31467 CUL Or.1081 2.75.1 Unknown #Latin script Mysterious page with various jottings in Hebrew script in a late hand, along with a few Latin characters (m, ma, mua). Syriac
31391 CUL Or.1081 2.75.10 List or table nan Small fragment of an account in Western Arabic numerals and what may be Ladino. Syriac
31472 CUL Or.1081 2.75.12 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31473 CUL Or.1081 2.75.13 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31475 CUL Or.1081 2.75.16 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31477 CUL Or.1081 2.75.19 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31468 CUL Or.1081 2.75.2 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31478 CUL Or.1081 2.75.20 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31479 CUL Or.1081 2.75.21 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31480 CUL Or.1081 2.75.23 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31481 CUL Or.1081 2.75.24 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31482 CUL Or.1081 2.75.26 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31483 CUL Or.1081 2.75.27 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31484 CUL Or.1081 2.75.28 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31469 CUL Or.1081 2.75.3 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31411 CUL Or.1081 2.75.30 Literary #Syriac #Garshuni Fragment containing "the text of the Makherzonutho or Proclamation that a deacon chants prior to the reading of the Gospel. . . [from] the Book of Anaphora, the priest's manual rather than the Tekso deacon manual." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31486 CUL Or.1081 2.75.31 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31487 CUL Or.1081 2.75.35 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31470 CUL Or.1081 2.75.6 Literary #Syriac "The Or.1081 2.75 material contains schooling exercises: practices of the alphabet and ligatures, repeated phrases from liturgical hymns, and snippets of Psalm readings. The carelessness in writing is simply due to the fact that we are looking at a pupil’s hand. While our pupil(s) had yet to master the esthetics of calligraphy, they seem to have been thrown into writing longer texts as part of their schooling." See George Kiraz, "A Young Syriac Pupil in the Cairo Genizah: Or.1081 2.75.30," Fragment of the Month, August 2018. Syriac
31471 CUL Or.1081 2.75.9 List or table nan Fragment of an account in western Arabic numerals; no words. Syriac
32091 T-S AS 204.351–56 Literary #Syriac Liturgical text, Nestorian. In Syriac. See Sebastian P. Brock, “East Syrian Liturgical Fragments from the Cairo Genizah,” Oriens Christianus 68 (1984) pp. 58-79. Idem, “Some Further East Syrian Liturgical Fragments from the Cairo Genizah,” Oriens Christianus 74 (1990) pp. 44-61. Information from FGP. Syriac
29376 T-S 16.319 Literary #CUDL #palimpsest #Syriac Palimpsest consisting of the Palestinian Talmud, Peʾa 18d and 20b-c, written over a Syriac text, The Life of St Anthony by Athanasius of Alexandria. Edited in Lewis (1902: 146-149) as text XXXV. (Information from CUDL) Hebrew; Aramaic; Syriac
richmanrachel commented 3 years ago

@rlskoeser - I think everything but the Syriac texts without scripts are now resolved based on updates to the language spreadsheet.

rlskoeser commented 3 years ago

@richmanrachel finally circling back to this and wanted to confirm what I think we decided (based on looking back at meeting notes, searching Slack, and my memory) before I implement any changes.

richmanrachel commented 3 years ago

@rlskoeser - thank you for the followup!

Coptic numerals in the spreadsheet should be mapped to Greek/Coptic Numerals

  • Correct.

You proposed adding Unknown language + Hebrew script, which would be for the "Hebrew or Judaeo-Arabic" document, but I don't see it in the Language+Script spreadsheet; is this still the plan?

  • Sorry about that - yes, it's done now!

I think we proposed adding Syriac (Unknown script) for mapping the Syriac documents in the spreadsheet, but that hasn't been added to the Language+Script spreadsheet; is this still the plan ? (I couldn't find where we discussed/decided this)

  • Alan talked to a Syriac professor and got some pointers on how to identify the script. Can we give him a couple weeks to just go through the current documents? Or should I add a Syriac (unknown script) for now?

Should the Turkish document be mapped to Modern Turkish?

  • Yes.
richmanrachel commented 3 years ago

Update, Alan would prefer that we have a Syriac (Unknown Script) category anyway for future researchers, so I'll add that now.

rlskoeser commented 3 years ago

Reran data import in QA with revised language mapping based on display name and (for a few cases) the new spreadsheet_name column I added to help with the import.

The only document that's still reporting a problem related to language is the "Greek or Coptic" one, which I know we're working on resolving. (I don't expect it to change language import logic.)

script output:

Imported 35 collections
Imported 49 languages
skipping PGPID 27264 (demerge)
... [lots more skipped] ...
ERROR language not found. PGPID: 31232, Language: Greek or Coptic
... [more skipped] ...
skipping PGPID 28089 (demerge)
Imported 29834 documents, 926 with joins; skipped 179
richmanrachel commented 3 years ago

@rlskoeser - MR has a question: why are languages the only things that are clickable (rather than display name or script as well)?

rlskoeser commented 3 years ago

MR has a question: why are languages the only things that are clickable (rather than display name or script as well)?

By default django admin uses the first value as the one that is clickable to go into the edit form. We can change the order or set others to be clickable as well. I wondered about using display name, but that's an optional field. We could use the auto-generated display name for records without a custom one set, but I wasn't sure if that would be confusing!

richmanrachel commented 3 years ago

@rlskoeser - That's really helpful. I think the only language we have without a display name now is Unknown in Hebrew characters which we can make a display name.

Decision: Let's make display name the default and the only clickable one.

richmanrachel commented 3 years ago

@rlskoeser - I assume we'll need to retest once this is the case, so we shouldn't bother doing the rest of this testing now?

rlskoeser commented 3 years ago

Do you still want language & script displayed on the list view? Or only display name?

rlskoeser commented 3 years ago

Actually, language+script admin display is not part of this story! please test whether the import is working properly and assigning the correct languages

You can open a new issue to request adjusting the language+script list display

richmanrachel commented 3 years ago

"Hebrew or Judaeo-Arabic" was officially changed to "Unknown: Hebrew script" but it all works!