CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
30 stars 13 forks source link

early 2024 release #99

Closed ctschroeder closed 3 weeks ago

ctschroeder commented 5 months ago

Timeline

Sentence splitting script completed by early March Initial corpora deadline March 31?? Data freeze April 15 ish??

Version information

~Currently labeling files as v 4.6.0 but we can shift to 5.0 if we release Bohairic and deem it a major revision~

version is 5.0.0

Currently using 2024-04-01 as version date but ~can~ should be changed

Corpora

Possibly:

edited corpora (not new)

Potential division of releases spring/fall: Spring release material hanging over from last year; fall newer material

Fixes

Postponed for next time

ctschroeder commented 3 months ago

@LCBM0828 as discussed, the items on the top are top priority. If there is an OCR'd document with full-auto NLP and good metadata that would be easy to figure out versification for, you can get that ready. Please add it to the list, make a release thread in the relevant repo, and link the release thread to this list. Thanks! (for full auto: editorial review is basically metadata, validation, and versification issues.)

ctschroeder commented 3 months ago

@amir-zeldes

The other Marcion material may need to wait because the metadata requires some research, but I'll try. I put them lower on the list because they require more research than the others.

amir-zeldes commented 3 months ago

Done - it's Mark and 1 Cor (all 16 chapters of both are completely reviewed for segmentation and partly treebanked/tagging checked).

ctschroeder commented 3 months ago

@amir-zeldes we need to decide on a version number and date. It is going to take me 2-3 weeks from now I think to actually finish reviewing everything and add all the metadata. What do you think about the date? And should the version number be 4.6 or 5.0 (since we will be releasing Bohairic?)

ctschroeder commented 3 months ago

Done - it's Mark and 1 Cor (all 16 chapters of both are completely reviewed for segmentation and partly treebanked/tagging checked).

@amir-zeldes great thanks. What do you think about the URNs for Bohairic? :) will the edition namespace be sufficient?

urn:cts:copticLit:ot.ruth.coptot:1 is the urn for the Goettingen Coptic OT project's edition, which is a mishmash. "coptot" is in the edition namespace to indicate that. Marcion's NT is from Horner so we would have Horner_ed in the edition namespace. Does that sound ok?

amir-zeldes commented 3 months ago

I think 5.0 is appropriate. We can have a place holder for the date - as long as it's a unique string I can batch replace it in the database for all docs, though we would need to recommit everything once it's finalized. Alternatively we can aim for April 30 or something?

For the URNs it's not so easy... I could see an argument for a top level distinction, but then we should have done copticSahidicLit and copticBohairicLit, and it's a bit late for that. As it stands, there will be no good CTS way to capture all of the Bohairic data in the future. Using horner_ed for Horner sounds fine to me, I'm just a bit unhappy there's no indication of the dialect in the CTS hierarchy...

ctschroeder commented 3 months ago

For the URNs it's not so easy... I could see an argument for a top level distinction, but then we should have done copticSahidicLit and copticBohairicLit, and it's a bit late for that. As it stands, there will be no good CTS way to capture all of the Bohairic data in the future. Using horner_ed for Horner sounds fine to me, I'm just a bit unhappy there's no indication of the dialect in the CTS hierarchy...

Yes I understand. I think we went with semantic URNs since the field is such a mess compared to Greek. Then there was also the discussion launched by Frank Kammerzel back at Digital Coptic 1 in Berlin about the wisdom of categorizing by dialect to begin with. So I understand why we did what we did in retrospect. It is hard to anticipate all issues.

There are other URN systems we can apply in addition to CTS (like CITE2 though their main website seems to be down. I suppose we can email them -- we know them).

We will at least have "bohairic" in other metadata.

On that note, we do not have a "dialect" (or "assigned dialect") field in the document metadata -- should we be adding that to all our documents?

amir-zeldes commented 2 months ago

OK, sounds fine, I ultimately use the metadata for a lot of filtering like that, for example to train tools only on gold or also checked data, mix data for augmentation, etc.

We don't have dialect, but language has a value like "Sahidic Coptic", so I assumed we would be filling "Bohairic Coptic" there instead.

cluckmarq commented 2 months ago

I won't have any AP in this release. Sorry all!

ctschroeder commented 2 months ago

@EarlyCodices & @amir-zeldes Please see https://github.com/CopticScriptorium/bible-dev/issues/74 re 1Cor

ctschroeder commented 2 months ago

@amir-zeldes & @EarlyCodices Please see https://github.com/CopticScriptorium/bible-dev/issues/75 re Mark

ctschroeder commented 2 months ago

@amir-zeldes This Great House is ready. Both docs validate but we need a new release date. Please pick one for the version_date in doc and corpus metadata and lmk

ctschroeder commented 2 months ago

@amir-zeldes Witness ready

ctschroeder commented 2 months ago

@amir-zeldes shenoute.crushed ready

ctschroeder commented 2 months ago

@amir-zeldes a22 ready except for identities in two docs see here

amir-zeldes commented 2 months ago

I had a look, all done - see my comments in the other thread

amir-zeldes commented 2 months ago

@ctschroeder do you know which new AP are slated for release this time? A lot of them have nothing checked in the AP issues so I'm not sure which if any we're releasing?

ctschroeder commented 2 months ago

@amir-zeldes working on it :)

amir-zeldes commented 2 months ago

OK, I won't touch those yet then, just setting up my staging area for the release...

ctschroeder commented 2 months ago

@amir-zeldes AP corpus is ready

ctschroeder commented 2 months ago

@amir-zeldes So Listen is ready

amir-zeldes commented 2 months ago

What about abraham? There have been some edits and I usually release all corrections to existing corpora on each release, but there are different document statuses and assignments there. Can I just take a snapshot of what's there now and call it 5.0.0? It's fine if there are still future corrections pending, as long as what's there is valid.

amir-zeldes commented 2 months ago

@ctschroeder Another question: page numbers in errs have square brackets like XG[336] which doesn't validate; can we take out the brackets?

amir-zeldes commented 2 months ago

Also, house is missing URN, witness and Trismegistos (is it none?)

ctschroeder commented 2 months ago

@amir-zeldes I see these questions and will get to them tonight or tomorrow. I have one more administrative deadline today

ctschroeder commented 2 months ago

What about abraham? There have been some edits and I usually release all corrections to existing corpora on each release, but there are different document statuses and assignments there. Can I just take a snapshot of what's there now and call it 5.0.0? It's fine if there are still future corrections pending, as long as what's there is valid.

@bkrawiec you were working on YA 525-30 in Naples. Is there anything else you need to do?

bkrawiec commented 2 months ago

@amir-zeldes @ctschroeder I don't think anything I was doing in Naples was meant for this release, since I thought the deadline for this release was April 1. I would have to go back and recall everything--I don't want to hold things up.

ctschroeder commented 2 months ago

@bkrawiec no worries. What every you did is rolling forward into this release anyway bc we have corrections to that corpus. If you have more edits based on your work in Naples those can go in a later release.

ctschroeder commented 2 months ago

@amir-zeldes abraham is ready. I don't know what was going on with ZH that the version n & date were missing but it's fixed

ctschroeder commented 2 months ago

@ctschroeder Another question: page numbers in errs have square brackets like XG[336] which doesn't validate; can we take out the brackets?

The brackets are there because the page numbers are not visible but the pages' locations have been reconstructed based on paleography. I'm almost positive we have done this with some of the johannes corpus, as well.

Is it that it doesn't validate according to our rules or it really cannot validate in TEI, PAULA, etc, XML?

ctschroeder commented 2 months ago

Also, house is missing URN, witness and Trismegistos (is it none?)

@amir-zeldes I am not understanding the question. I've checked the documents in gitdox and the document history in GitHub and both shenoute.house docs have had urns, witness, and TM fields filled since I last committed them. There is an older doc in GitHub because one was misnamed, but I thought you processed the release from GitDox not GitHub

amir-zeldes commented 2 months ago

Is it that it doesn't validate according to our rules or it really cannot validate in TEI, PAULA, etc, XML?

If it's an xml:id then I believe it cannot validate in TEI, and that's how we've had our data model so far. This does not affect PAULA or SGML, since key names are actually XML values in PAULA, and not for SGML since it has no value pattern restrictions except for mandatory escaping. So basically: either we replace the brackets with something else (underscores or something?) or we change the data model and rename the page number attribute to something other than xml:id.

amir-zeldes commented 2 months ago

both shenoute.house docs have had urns, witness, and TM fields

Oh whoops, I meant shenoute.errs.XG336-343, got the corpora mixed up, sorry!

ctschroeder commented 2 months ago

That corpus is not checked off and the doc is not listed as to_publish so no, it’s not readySent from my iPhoneOn May 5, 2024, at 12:15 PM, Amir Zeldes @.***> wrote:

both shenoute.house docs have had urns, witness, and TM fields

Oh whoops, I meant shenoute.errs.XG336-343, got the corpora mixed up, sorry!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

amir-zeldes commented 2 months ago

OK, but it's intended to be included in this batch still right?

ctschroeder commented 2 months ago

Yeah there was just a lot to review. I’m halfway done — hope to be done tonight.

amir-zeldes commented 2 months ago

Great, no rush!

ctschroeder commented 2 months ago

@amir-zeldes I am still finishing Errs, need to check the metadata for PS as per our email.

I am also waiting for you to let me know that Bohairic 1 Cor and Mark are ready for me to check the metadata one last time.

Plus other issues under Fixes & edited corpora.

ctschroeder commented 2 months ago

@amir-zeldes errs is done gold Ruth is done (but see my note in #103 ) have checked corpus metadata for all docs listed to_publish in Gitdox; all are good except:

ctschroeder commented 2 months ago

oops have to add I did not realize you were releasing the Treebank corpus @amir-zeldes . Did this issue for the treebank get addressed before release?

amir-zeldes commented 2 months ago

Hi! Quick answers:

ctschroeder commented 2 months ago

@amir-zeldes 1 Cor is ready

ctschroeder commented 2 months ago

@amir-zeldes Mark ready. All docs of both 1Cor and Mark need reimporting

ctschroeder commented 1 month ago

@amir-zeldes for Pistis Sophia: the first three documents are good to go. You can use the metadata in part 4 of Book 1 in all the rest moving forward (if all the rest have the Mead translation from gnosis.org -- if not, lmk). Please do not use the metadata in Part 4 for Parts 1-3 of Book 1. Thanks!

amir-zeldes commented 1 month ago

OK, we should be all set! Please take a look and if everything looks OK we should be able to announce the release now!

ctschroeder commented 1 month ago

Sorry this got buried in a wave of email. Everything I checked looks good.

amir-zeldes commented 1 month ago

Great - shall we announce the release? I'm at a conference this week, so I'm happy to let you Lydia or Nick do it!