Closed ctschroeder closed 3 weeks ago
@LCBM0828 as discussed, the items on the top are top priority. If there is an OCR'd document with full-auto NLP and good metadata that would be easy to figure out versification for, you can get that ready. Please add it to the list, make a release thread in the relevant repo, and link the release thread to this list. Thanks! (for full auto: editorial review is basically metadata, validation, and versification issues.)
@amir-zeldes
The other Marcion material may need to wait because the metadata requires some research, but I'll try. I put them lower on the list because they require more research than the others.
Done - it's Mark and 1 Cor (all 16 chapters of both are completely reviewed for segmentation and partly treebanked/tagging checked).
@amir-zeldes we need to decide on a version number and date. It is going to take me 2-3 weeks from now I think to actually finish reviewing everything and add all the metadata. What do you think about the date? And should the version number be 4.6 or 5.0 (since we will be releasing Bohairic?)
Done - it's Mark and 1 Cor (all 16 chapters of both are completely reviewed for segmentation and partly treebanked/tagging checked).
@amir-zeldes great thanks. What do you think about the URNs for Bohairic? :) will the edition namespace be sufficient?
urn:cts:copticLit:ot.ruth.coptot:1 is the urn for the Goettingen Coptic OT project's edition, which is a mishmash. "coptot" is in the edition namespace to indicate that. Marcion's NT is from Horner so we would have Horner_ed in the edition namespace. Does that sound ok?
I think 5.0 is appropriate. We can have a place holder for the date - as long as it's a unique string I can batch replace it in the database for all docs, though we would need to recommit everything once it's finalized. Alternatively we can aim for April 30 or something?
For the URNs it's not so easy... I could see an argument for a top level distinction, but then we should have done copticSahidicLit and copticBohairicLit, and it's a bit late for that. As it stands, there will be no good CTS way to capture all of the Bohairic data in the future. Using horner_ed for Horner sounds fine to me, I'm just a bit unhappy there's no indication of the dialect in the CTS hierarchy...
For the URNs it's not so easy... I could see an argument for a top level distinction, but then we should have done copticSahidicLit and copticBohairicLit, and it's a bit late for that. As it stands, there will be no good CTS way to capture all of the Bohairic data in the future. Using horner_ed for Horner sounds fine to me, I'm just a bit unhappy there's no indication of the dialect in the CTS hierarchy...
Yes I understand. I think we went with semantic URNs since the field is such a mess compared to Greek. Then there was also the discussion launched by Frank Kammerzel back at Digital Coptic 1 in Berlin about the wisdom of categorizing by dialect to begin with. So I understand why we did what we did in retrospect. It is hard to anticipate all issues.
There are other URN systems we can apply in addition to CTS (like CITE2 though their main website seems to be down. I suppose we can email them -- we know them).
We will at least have "bohairic" in other metadata.
On that note, we do not have a "dialect" (or "assigned dialect") field in the document metadata -- should we be adding that to all our documents?
OK, sounds fine, I ultimately use the metadata for a lot of filtering like that, for example to train tools only on gold or also checked data, mix data for augmentation, etc.
We don't have dialect, but language has a value like "Sahidic Coptic", so I assumed we would be filling "Bohairic Coptic" there instead.
I won't have any AP in this release. Sorry all!
@EarlyCodices & @amir-zeldes Please see https://github.com/CopticScriptorium/bible-dev/issues/74 re 1Cor
@amir-zeldes & @EarlyCodices Please see https://github.com/CopticScriptorium/bible-dev/issues/75 re Mark
@amir-zeldes This Great House is ready. Both docs validate but we need a new release date. Please pick one for the version_date in doc and corpus metadata and lmk
@amir-zeldes Witness ready
@amir-zeldes shenoute.crushed ready
@amir-zeldes a22 ready except for identities in two docs see here
I had a look, all done - see my comments in the other thread
@ctschroeder do you know which new AP are slated for release this time? A lot of them have nothing checked in the AP issues so I'm not sure which if any we're releasing?
@amir-zeldes working on it :)
OK, I won't touch those yet then, just setting up my staging area for the release...
@amir-zeldes AP corpus is ready
@amir-zeldes So Listen is ready
What about abraham? There have been some edits and I usually release all corrections to existing corpora on each release, but there are different document statuses and assignments there. Can I just take a snapshot of what's there now and call it 5.0.0? It's fine if there are still future corrections pending, as long as what's there is valid.
@ctschroeder Another question: page numbers in errs have square brackets like XG[336]
which doesn't validate; can we take out the brackets?
Also, house is missing URN, witness and Trismegistos (is it none?)
@amir-zeldes I see these questions and will get to them tonight or tomorrow. I have one more administrative deadline today
What about abraham? There have been some edits and I usually release all corrections to existing corpora on each release, but there are different document statuses and assignments there. Can I just take a snapshot of what's there now and call it 5.0.0? It's fine if there are still future corrections pending, as long as what's there is valid.
@bkrawiec you were working on YA 525-30 in Naples. Is there anything else you need to do?
@amir-zeldes @ctschroeder I don't think anything I was doing in Naples was meant for this release, since I thought the deadline for this release was April 1. I would have to go back and recall everything--I don't want to hold things up.
@bkrawiec no worries. What every you did is rolling forward into this release anyway bc we have corrections to that corpus. If you have more edits based on your work in Naples those can go in a later release.
@amir-zeldes abraham is ready. I don't know what was going on with ZH that the version n & date were missing but it's fixed
@ctschroeder Another question: page numbers in errs have square brackets like
XG[336]
which doesn't validate; can we take out the brackets?
The brackets are there because the page numbers are not visible but the pages' locations have been reconstructed based on paleography. I'm almost positive we have done this with some of the johannes corpus, as well.
Is it that it doesn't validate according to our rules or it really cannot validate in TEI, PAULA, etc, XML?
Also, house is missing URN, witness and Trismegistos (is it none?)
@amir-zeldes I am not understanding the question. I've checked the documents in gitdox and the document history in GitHub and both shenoute.house docs have had urns, witness, and TM fields filled since I last committed them. There is an older doc in GitHub because one was misnamed, but I thought you processed the release from GitDox not GitHub
Is it that it doesn't validate according to our rules or it really cannot validate in TEI, PAULA, etc, XML?
If it's an xml:id then I believe it cannot validate in TEI, and that's how we've had our data model so far. This does not affect PAULA or SGML, since key names are actually XML values in PAULA, and not for SGML since it has no value pattern restrictions except for mandatory escaping. So basically: either we replace the brackets with something else (underscores or something?) or we change the data model and rename the page number attribute to something other than xml:id.
both shenoute.house docs have had urns, witness, and TM fields
Oh whoops, I meant shenoute.errs.XG336-343, got the corpora mixed up, sorry!
That corpus is not checked off and the doc is not listed as to_publish so no, it’s not readySent from my iPhoneOn May 5, 2024, at 12:15 PM, Amir Zeldes @.***> wrote:
both shenoute.house docs have had urns, witness, and TM fields
Oh whoops, I meant shenoute.errs.XG336-343, got the corpora mixed up, sorry!
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
OK, but it's intended to be included in this batch still right?
Yeah there was just a lot to review. I’m halfway done — hope to be done tonight.
Great, no rush!
@amir-zeldes I am still finishing Errs, need to check the metadata for PS as per our email.
I am also waiting for you to let me know that Bohairic 1 Cor and Mark are ready for me to check the metadata one last time.
Plus other issues under Fixes & edited corpora.
@amir-zeldes errs is done gold Ruth is done (but see my note in #103 ) have checked corpus metadata for all docs listed to_publish in Gitdox; all are good except:
oops have to add I did not realize you were releasing the Treebank corpus @amir-zeldes . Did this issue for the treebank get addressed before release?
Hi! Quick answers:
@amir-zeldes 1 Cor is ready
@amir-zeldes Mark ready. All docs of both 1Cor and Mark need reimporting
@amir-zeldes for Pistis Sophia: the first three documents are good to go. You can use the metadata in part 4 of Book 1 in all the rest moving forward (if all the rest have the Mead translation from gnosis.org -- if not, lmk). Please do not use the metadata in Part 4 for Parts 1-3 of Book 1. Thanks!
OK, we should be all set! Please take a look and if everything looks OK we should be able to announce the release now!
Sorry this got buried in a wave of email. Everything I checked looks good.
Great - shall we announce the release? I'm at a conference this week, so I'm happy to let you Lydia or Nick do it!
Timeline
Sentence splitting script completed by early March Initial corpora deadline March 31?? Data freeze April 15 ish??
Version information
~Currently labeling files as v 4.6.0 but we can shift to 5.0 if we release Bohairic and deem it a major revision~
version is 5.0.0
Currently using 2024-04-01 as version date but ~can~ should be changed
Corpora
Possibly:
edited corpora (not new)
Potential division of releases spring/fall: Spring release material hanging over from last year; fall newer material
Fixes
Postponed for next time