CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

Fall 2019 Publication Thread #27

Closed ctschroeder closed 4 years ago

ctschroeder commented 5 years ago

Timeline:

Version 3.0, version date 2019-09-30

List of materials as of 16 September.

PATHS texts (see #26 @ctschroeder )

Later in Fall

ctschroeder commented 4 years ago

I would like to add "copyist" or "scribe" to the metadata for the relevant texts in Marcion copied by Victor son of Mercurius (Onophrius, Cyrus, possibly others). See Layton's catalog https://www.dropbox.com/s/s7gdapyphgpc3mb/pLondCopt%20II%20%28Layton%29.pdf?dl=0. Everyone please let me know ASAP if you have any objections.

cluckmarq commented 4 years ago

No objection.

Sent from my iPhone

On Sep 16, 2019, at 1:15 PM, Caroline T. Schroeder notifications@github.com wrote:

I would like to add "copyist" or "scribe" to the metadata for the relevant texts in Marcion copied by Victor son of Mercurius (Onophrius, Cyrus, possibly others). See Layton's catalog https://www.dropbox.com/s/s7gdapyphgpc3mb/pLondCopt%20II%20%28Layton%29.pdf?dl=0. Everyone please let me know ASAP if you have any objections.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ctschroeder commented 4 years ago

Also the list of materials for publication is now set. I am going to check to see if someone can review my doc from Johannes. Can everyone working on docs for this round of publication be sure that the docs are appropriately tagged in GitDox as "review"? @amir-zeldes the treebank corpora don't need review (except the Cyrus and Onophrius docs which need metadata revisions). Can you or @lancealanmartin please check that all the treebank docs except Cyrus and Onophrius have the correct version_n and version_date and are marked "review"? Thanks!! Once that is done, they can be labeled "to publish" in GitDox and checked off here on the top of the thread. If you can't do it I'll get to it later this week or next week -- just let me know the scoop. Thank you so much!!!

amir-zeldes commented 4 years ago

Adding scribe sounds fine to me (sounds better than copyist for my ears, but I'm fine with either)

I have reviewed all of the re-release documents (Shenoute, AP, Besa) and made sure the version is 3.0.0 + dated if they have been edited since last release, so those should be all good. The new Budge materials are either recent additions to the treebank (onno1, cyrus1, ephraim, respose), or they have been checked by either Lance or me, so I think they are OK as 'checked' and only need metadata review, no linguistic review necessary at this point.

Now assigned for (metadata) review to @ctschroeder :

The following are assigned to others for sentences/translation, but also need metadata review:

Thanks!

ctschroeder commented 4 years ago

thx so much @amir-zeldes! I will deal with Proclus and Victor when others are done with them. Whoever's working on them can assign them to me when done.

ctschroeder commented 4 years ago

Greetings @amir-zeldes @lancealanmartin @eplatte @bkrawiec @cluckmarq. I'm working on URNs for the Marcion material. Part of the CTS URN is the "text group" and part is the "edition." A few questions have arisen. Apologies for the long post! Replies requested by the end of the week if humanly possible. This comment contains a fair bit of info in regular text with precise questions in bold.

This is PART 1. There may be a PART 2 as I work through the other texts.

For texts with known, identified authors, the "group" is the author. So urn:cts:copticLit:besa.aphthonia.monbba refers to the text "Letter to Aphthonia" in the text group "Writings of Besa" in the edition manuscript MONB.BA. For the material edited by Budge in Marcion we have two questions: What "text group" to designate and what "edition"?

1) Edition (the simplest question): for the Martyrdom of Victor we used Budge as the edition. I suggest we do the same with the rest of the Marcion materials from Budge. Please let me know if you have objections to "Budge" as edition in the URN. no need to reply if this is fine.

2) For the "group" for each text/work we have a few options:

Thank you!! Possibly more tomorrow on other Marcion texts/works.

eplatte commented 4 years ago

I like lives for the text group for Onophrius and Cyril, and Ephrem for the spelling and psephrem for the text group for the epistle. Budge also makes sense for the edition.

amir-zeldes commented 4 years ago

Agreed on lives, budge and adding a pseudo prefix. For the spelling of Ephraim, I feel like we've been using mostly Latin spellings for some reason (onnophrius with 'u', cyrus with 'cy' and 'u'), so something like ephraem or ephrem seems more consistent than 'ai'. Whatever makes more sense as the 'Latin' form I would say.

amir-zeldes commented 4 years ago

OK, auto sentence spans are now added to paths. Some things to note:

  1. Quality depends on three things:
    • How good the NLP did/how badly segmented the original was (good: e.g. Aphou, less good, e.g. Longinus)
    • Whether or not there's punctuation (Phib is the best: good NLP, punctuation; Aphou is not as good - no punctuation)
    • Luck (really, coincidental similarity to the limited training data)
  2. When quality is bad, and especially if there's no punctuation at all, there are sometimes super-long sentences. I manually broke up 3-4 instances where 'sentence' length was >400 words. This was mainly a problem in Paul of Tamma (no punctuation, and for some reason the sentences went very long stretches without breaking)
  3. I should note the sentencer is biased towards caution: it prefers to abstain when things look murky, and the upshot is it makes fewer truly crazy splits.

This all means we can now have the analytic vis for Paths. Note that because we do not have chapters, and the p tags (which seem fairly random) do not coincide with auto-sentences, we do not have a verses view for this data at the moment.

ctschroeder commented 4 years ago

@amir-zeldes I'll take a look this week about the chapters in PATHS. It sounds like other than that and the metadata, they are done? We are talking about Paul of Tamma, Phib, Aphou, Longinus and Luke (or no Longinus and Luke -- https://github.com/paths-erc/coptic-texts/blob/master/cc0418.xml). Thanks.

amir-zeldes commented 4 years ago

Yes, since we're releasing this as auto NLP they are basically done. If you want to do chapters let me know, but time is getting short - if so, they should properly nest 'translation' so we can do the blockified (non-numbered) verses view. Thanks!

And I think it is Longinus and Luke, the TEI header there is incorrectly copy-pasted from another file, right?

ctschroeder commented 4 years ago

Yes re Longinus and Luke.

Re chapters: part of the issue is the document URN usually includes the chapters, but we can skip that and just use the edition namespace as the end. Am wondering if the edition should be "CMCL" since it's taken from Tito Orlandi's editions (see for example this referenced in the paths header for Paul of Tamma) http://www.cmcl.it/~cmcl/paolotamma1.PDF

ctschroeder commented 4 years ago

or should the edition be "paths"? I think this is the best strategy, actually. Something like urn:cts:copticLit:lives.pauloftamma.cmcl or urn:cts:copticLit:lives.pauloftamma.paths

amir-zeldes commented 4 years ago

I also think it should be paths, since it includes paths annotations (e.g. their entity schema) and we don't actually know what processing steps happened between CMCL and their version. Saying it's paths is the simplest statement, and Paths's provenance from CMCL is something that should be described by Paths IMO

ctschroeder commented 4 years ago

Bingo

lancealanmartin commented 4 years ago

I can add PATHS as the edition. What should the collection be?

ctschroeder commented 4 years ago

hello, @amir-zeldes. Johannes.canons is ready for viz check; any documents with to_publish or review status. Beth is reviewing the doc needing review. Thanks so much!

ctschroeder commented 4 years ago

ok thanks all! And thanks for your patience. I've been doing more research. So some urns for what we have discussed from Marcion: Marcion:

Please let me know if you have preferences for the title of this letter of ps-ephrem and for the "work" namespace for the URN for psephrem.

For the Dormition of John: what should we use as the text group? I am resistant to "apocrypha" because that is applicable but fuzzy. Any suggestions? (This is the problem with the CTS URN system -- it's a "canonical text services" urn model.) One option is "unknown"; one option is something more like Perseus (which has multiple anonymous/unknown author groups depending on source of material). They typically use non-semantic fields in their namespaces (so, Life of Cornelius by Tacitus is urn:cts:latinLit:phi1351.phi001 with phi1351 being the text group (phi1351=Tacitus). They have several anonymous/unknown classifiers for textgroups --see https://catalog.perseus.org/catalog/urn:cts:latinLit:phi0990. We could do something similar; that would make things messier in terms of having some URNs with clear semantic meanings in that namespace, and some URNs with not so semantically clear.

I think I would prefer a textgroup that does not correspond to genre/etc. than try to come up with a good semantically understandable genre group name. So something like urn:cts:copticLit:marcunknown.dormjohn.budge, urn:cts:copticLit:unknownm.dormjohn.budge, or urn:cts:copticLit:marc001.dormjohn.budge are all preferable to urn:cts:copticLit:apocrypha.dormjohn.budge . Then everything we get from Marcion that is an unnamed author and not lives or martyrdoms can go there. Thoughts?

amir-zeldes commented 4 years ago

I'm OK with psephrem, and 'letters' (similar to Besa) or 'letter' if it has to be (I think there might be more than one somewhere, since it says 'another epistle...'(?)

For dormition, I'm not sure using 'marc' to denote Marcion is the best here, since it seems a bit coincidental that this thing is in Marcion and something else isn't. The reason it's in Marcion is that it was in the same Budge edition that was available to the project, so 'budge' (though normally an 'edition' part) almost makes more sense to me as a text group than 'marc'.

Thinking about it that way, isn't apocrypha maybe better than 'marc'? We could also do 'misc' or 'other' or anything else, or decide that dormitions are a genre ('dorm')?

ctschroeder commented 4 years ago

To @amir-zeldes. Regarding pseudo-ephrem: We don't use "letters" for that "work" namespace for Besa. We use the topic of the letter: food, aphthonia, etc. (There's only one letter attested in Coptic. There are other pseudo-ephrem documents, though, and Budge says "another" because this text follows another pseudo-ephrem work--the ascetica--in his edition. But everything in parentheses here is beside the point, just an fyi.) So the choice for that namespace is "letter" or "to_disciple"/something indicating the topic as in the Besa corpus.

Regarding Dormition of John: I do not want to use apocrypha or dormitions. I don't want to get in the habit of creating text groups that may or may not have existed in this way in antiquity. I'm ok with lives, letters, and martyrdoms, because while there is some fluidity there is also some sense they are genres in antiquity. Apocrypha is not a genre. Dormition, eh.... I mentioned including Marcion in someway because if you follow the links in Perseus you'll see that that's what they do -- they have a bunch of "misc" categories/numbers, which seem to correspond to the different sources for their "misc" texts. If apocrypha and dormition are out, what do you think? misc?

amir-zeldes commented 4 years ago

Between misc and marc, I prefer misc. That says something semantically (e.g. "this is not one of the other established text groups"), whereas marc is a statement about a historical coincidence of what happened to be digitized in Marcion, all while other things that were digitized in Marcion do not get the same group. So for me 'misc' is definitely the better option.

ctschroeder commented 4 years ago

Greetings! I'm currently updating corpus metadata. It looks like someone else is also adding corpus metadata, which is wonderfully helpful. A couple of tips: the dates for version_date always have hyphens (30-9-2019), we use full names for annotators (Elizabeth Davidson vs Liz), and we include all annotators in the corpus (so we need to check each document to be sure the doc annotators are all included). Thanks so much! If you are working on corpus metadata, please mark the document for "review" and assign it to me to check when you are done. Thanks again for the help!

ctschroeder commented 4 years ago

ugh -- I meant 2019-09-30 in that previous comment. My apologies!

Also @amir-zeldes have the people who treebanked the documents listed at the beginning of this thread been added to the annotation metadata for documents and corpora? I looked at a couple AP and the Abraham files, and I don't see new names in the metadata. If it's someone already listed, that's great. If not, the names need to be added to corpus and document metadata before those corpora can be released. I'll add a note above as well. Thanks so much. (I myself don't know who's done what.)

amir-zeldes commented 4 years ago

version_date (and _n) has a validation, so that should get automatically flagged if someone used the wrong format.

According to a SQL query on the database, there are now no longer any documents with 'Liz', so that should be fine, but yes let's remember to always do full names!

Treebanking info:

Of these, everything was already in corpus metadata, except the only missing one I found was 1Cor, which had no corpus metadata. I copied it over from Mark and added all of the treebankers + Carrie, but I'm not sure who else has added 1Cor without treebanking (that's just who I'm seeing in the documents). Feel free to add if you know someone else!

ctschroeder commented 4 years ago

Thank you Amir! (I don't believe corpus metadata errors crop up in validation.) I will check 1 Cor annotators.

lancealanmartin commented 4 years ago

I did entity annotation for the first three chapters of both 1 Cor and Mark as well as shenoute.fox. Should I add my name to these docs?

ctschroeder commented 4 years ago

Yes @lancealanmartin please add your name to any document you edited, and then also add it to the corpus metadatum for annotation. Giving full credit to everyone is a major principle of ours!! Most documents have the primary annotator first, subsequent annotators in the middle, and the senior editor(s) who reviewed the document (usually Amir or me, sometimes Beth) as the last name.

amir-zeldes commented 4 years ago

I have no issues with adding Lance to those documents, as entity annotations will one day be released, but just to clarify, those entity annotations are not currently available in the online corpora.

As for annotator order: I'm embarrassed to say I seem to have had this wrong. I think anything where I added the names I did alphabetically by last name... Since Carrie and I are alphabetically relatively high, this may often match the pattern Carrie is mentioning, but anything I added annotation/translation to is probably just alphabetic. Also, in the repo interface, these things get split up and are findable separately no matter the order they are listed in inside the field.

ctschroeder commented 4 years ago

No worries. I think order primarily a big deal for manually edited documents rather than the automated ones and especially by junior folks; I try to keep an eye out for this during publication.

ctschroeder commented 4 years ago

@amir-zeldes the Marcion corpora are ready and should be frozen. Marcion corpora that are also in the gold treebank corpora will need metadata updated for the treebank files. TY!

ctschroeder commented 4 years ago

Hi @amir-zeldes I'm almost done with the johannes corpus -- checking visualizations, and I noticed that the new document is not in ANNIS. I see that there are 8 docs in the private instance and in the public one. I checked and FA215-224 is missing from the private instance. Thank you!

amir-zeldes commented 4 years ago

Got it. Try again now

ctschroeder commented 4 years ago

Oh goodness that was a doozy. I think due to the page layer being labeled pb_n instead of pb_xml_id. I hope that fixed it.

Also I am really sick (v sore throat) and so while Johannes is done the rest will have to wait for tomorrow.

amir-zeldes commented 4 years ago

Oh no, it's been going around here too. Feel better!

New version with on fix is already online.

ctschroeder commented 4 years ago

Johannes is good to go!

amir-zeldes commented 4 years ago

Thanks - right now TEI is not validating due to having chapter_n but not verse_n. We could revert it to 'p' mode, without chapters, but is there a reason the verses are 'ignore:'ed?

ctschroeder commented 4 years ago

Hi. Are we talking about Johannes or everything? For Johannes they’re ignored because I started and didn’t finish once we decided we didn’t need verse numbers for this release.

Re TEI this must be common for all the documents that don’t have verses? This is odd because I don’t remember this as a problem in the past. I’m also really too sick to brainstorm at the moment. Do what you think is best.

amir-zeldes commented 4 years ago

The decision is per corpus, so we can either switch off verse numbers in 'verses' for all documents, or I'm happy to add consecutive numbers to verses in each chapter myself if that would solve it. Also, if only one document doesn't have verses, it's TEI would have to look different from other documents in the corpus. Just give me your OK and I will add verse nums (they're mostly already there, I can easily finish)

Feel better!

ctschroeder commented 4 years ago

I’m not confident the numbers I have already are good sentences. If you have time to check please be my guest!

ctschroeder commented 4 years ago

@amir-zeldes it looks like we messed up the language/languages consistency in corpus metadata again. Is there an easy fix, or should I go back through all of them and check manually?

ctschroeder commented 4 years ago

@amir-zeldes sorry to bother you again but it appears the treebank annotators have not been added to document metadata in all the items. I'm noticing this in Mark. You've listed treebankers by corpus above, but I don't know which docs belong to whom. Can you please check the document level metadata to be sure the treebankers have been added? https://github.com/CopticScriptorium/corpora/issues/27#issuecomment-535609391 Thank you!

ctschroeder commented 4 years ago

(This may mean the corpora we thought should be frozen need to be fixed. I assumed the treebank folks had been added to doc level annotation.)

amir-zeldes commented 4 years ago

OK, I will look into these tomorrow

ctschroeder commented 4 years ago

A few final things for this evening:

  1. I'm noticing some corpora that are not ready are on public ANNIS. I'm guessing they are supposed to be behind the password and there was some glitch? At any rate, can they be removed right away? They are: AP, life of L&L, life of Phib, Mark (see below), 1 Cor?

  2. red alert! unfreezing:

    • there was a problem with Cyrus (now fixed and ready to be reprocessed for publication)
  1. I checked the other Treebanked docs to see if the treebankers were in the doc level metadata; for 1 Cor and AP I couldn't tell bc there were many docs edited and multiple treebankers. Again see top post](https://github.com/CopticScriptorium/corpora/issues/27#issue-454804372)

  2. I could not commit part 1 of Longinus & Lucius. No clue why not. It gave me a GitHub error. Can you please commit part 1? Then it will be ready to publish.

amir-zeldes commented 4 years ago
  1. ANNIS is a glitch from concurrent ANNIS4 security manager (you are actually seeing what's in ANNIS4 right now, including some non-ready tests). Now reverted, sorry about that.
  2. OK, Cyrus is reconverted and Mark + 1Cor document annotators are checked for the treebanked parts (Chap. 1-6 in both)
  3. 1 Cor and AP are good to go (were already correct)
  4. It seems that with the added annotations, it is now too big to commit via the API... I've committed it manually for now, but I'm opening an issue here https://github.com/gucorpling/gitdox/issues/155

PS - oh, weird, now that I've manually committed, I can actually commit small changes to Longinus, presumably because the diff is small(?)..

amir-zeldes commented 4 years ago

RE language/languages:

Was it intentional for corpora to have 'languages' to differentiate from the document level metadatum? In ANNIS, metadata queries just 'apply', so it the two fields conflict and are called the same, it's possible exact meta-based searches will actually yield zero results for these if they're called the same.

Let me know your thoughts about what to do and I can try to apply it.

ctschroeder commented 4 years ago

Beth did some digging into this a year ago. I don’t remember the logic, but we went with language for doc/languages for corpus. It gets mangled in corpus metadata bc that can’t be validated.

ctschroeder commented 4 years ago

Also thanks so much for all of this! I will be offline almost all day. I think I’ve done everything I can (except for those additional 3 paths texts). Please ping me if you need anything and I’ll check in tonight. Take care.

amir-zeldes commented 4 years ago

Sounds good! Which 3 texts though? I think there's Longinus and Phib, which have chapter numbers from PATHS (p_n), and Aphou and Paul, which have unnumbered paragraphs that we made (just p)

ctschroeder commented 4 years ago

Greetings from the airport. see the corpora checked/not checked at the top of the thread. L&L should be the only one checked/ready. That checklist at the top should be the final list. I’ve done everything I can on all the docs except the 3 unchecked PATHS texts. Leave those three alone. Everything else is either ready or has items to check off that only you can do. Good luck.