CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

Winter 2019-20 publication/release thread #40

Closed ctschroeder closed 4 years ago

ctschroeder commented 4 years ago

Last updated 1/1/20



Tentative Final corpora and docs:


amir-zeldes commented 4 years ago

There are a couple of other candidates to add, and maybe more will be added as we progress through the fall:

More depending on how far we get with Budge etc.

lancealanmartin commented 4 years ago

Victor docs now have valid chapter, verse, and pb_xml_id numbers. I think this is it on the annotation side unless we want to go over the pos tags more carefully.

Its metadata is complete except for version_n and version_date. I modeled the 'document_cts_urn' after victor, parts 1 and 2. @ctschroeder let me know if these look correct (no rush).

amir-zeldes commented 4 years ago

I think I'm OK with not going over POS tags again - once segmentation has been checked, it is typically quite high accuracy, and it seems like a better use of our time to release more texts than worrying about the last 2% tagging errors.

The third and final document of athansius.discourses is now up in XML mode, with automatic piping and the title soul_body

ctschroeder commented 4 years ago

Thank you @amir-zeldes & @lancealanmartin . Please make an issue in the victor repo for this publication cycle and link to it above in the main comment on this issue thread. Otherwise the info will get lost since there are so many docs in this release. In general, everyone please post corpus specific annotation information in the corpus-specific repos and link issues. Likewise @amir-zeldes pls make a thread for Athanasius over in its corpus. You can tag us there when you're ready for more eyes for whatever.
Let's try to save this thread for things tied specifically to the release as opposed to annotation issues for particular corpora. Otherwise this thread will become very difficult to track. So to recap my preference is if there are new texts or previous texts with new annotations or new docs being released, please:

  1. make an issue for the release cycle in the relevant corpus's GitHub repo
  2. add it to the tentative list at the top of this thread and link to the issue
  3. ping people in the relevant repo thread as needed with status updates/questions about your corpus
  4. Multiple issues if linked are better than one big one multi-tentacled thread, so with specific annotation questions or problems, feel free to make a new thread specifically for that
  5. Tag any issue with a Winter-2020-release label if you think it's relevant to this release. I will go over anything tagged with this label prior to release.
ctschroeder commented 4 years ago

(Quickly to add -- not complaining or concerned about anything posted here thus far. The work is amazing. I want to be sure none of it gets lost in a long thread. Thanks!)

amir-zeldes commented 4 years ago

Sounds fine - @lancealanmartin can you make an issue in Budge dev listing the Athanasius documents and link it in this thread?

bkrawiec commented 4 years ago

I am concerned that finishing Those (10 documents, 35 MSS pages) might be more than I can complete (classes begin Jan 13)--I don't have a good sense of how long it will take to check the spreadsheets. I don't want to commit and not get it done so I am undercommitting to GF 253-70 (4 documents), with the hope that I would also do GF 301-350 (4 documents). It would be helpful for someone else to do the two GL documents. I will not get to the known document. Carrie, if you want to start with the unknown and as I progress I can keep you updated so we can share the task of finishing this one.

ctschroeder commented 4 years ago

Thanks all. I have updated the threads in each corpus with checklists and assignments. Re timeline: @cluckmarq, @eplatte, and @bkrawiec all are on board with a January 31 deadline for annotations. So @amir-zeldes & @lancealanmartin I propose the following timeline:

I'm allowing a lot of time in February because there will be a lot of docs (hopefully) to review, and that will take some time. If this works, you can simply "thumbs up" the comment. If it doesn't work for you, please chime in and we can modify. Thank you!

And for Marcion docs, if you need me to review metadata, chapter numbers, etc. please assign the docs to me in GitHub as you go and ping me in the threads by my github handle. Thank you!!!

ctschroeder commented 4 years ago

Hi. Following up on @amir-zeldes' comment above about check/not checking POS/lemma/lang tags: I have noticed some significant errors in pos tagging after segmentation is checked especially around CPRET, NEG, CFOC, ACONJ. Also lemmatization issues, though fewer.

Whatever one decides to do, just please be sure each document's metadata indicates the level of checking: automatic, checked, or gold. So if segmentation has been checked/modified but pos/lemma/lang have not been modified, then please set segmentation=checked, tagging=automatic.

Thanks so much!

amir-zeldes commented 4 years ago

Yes, exactly - tagging should be 'automatic' in these cases. In terms of error likelihood, CFOC is one of the hardest to predict, and CPRET is comparatively easy. NEG and ACONJ are in the middle, I would say. If you anticipate a lot of CFOC in a text, it's worth searching through the CCIRC tags to verify (confusion with the past tense CREL is more rare).

ctschroeder commented 4 years ago

Hello! I think I've got the rundown: @amir-zeldes the Johannes corpus @lancealanmartin shenoute.those (it's 6 docs -- let me know if that's going to be too much) @cluckmarq shenoute.unknown5_1 (2 docs) @bkrawiec shenoute.seeks (1 longer doc) @ctschroeder athanasius + pachomius corpora (metadata only) + victor

When you do the review, can I ask you to:

version_n is 3.1.0

Please let me know if you have any concerns about these assignments. I have not gone through all the Shenoute material to check chapters and cts urns. I'll do that tomorrow and/or Fri.

@amir-zeldes I have not assigned anything a version_date yet. Do you have thoughts?

Last: @amir-zeldes I see cyrus.01, mark_01, a22.YA421-428 are all marked "review". Do they need review of annotations or just metadata? I believe they are all previously published.

amir-zeldes commented 4 years ago

I can auto assign the date once we know it to all documents with version_n 3.1.0 - it's important to update the version_n field though, and just leave some value in there (-- works, or if it's an older value that's fine too, as long as it's something)

As for Cyrus/Mark/A22, this is a result of our policy to set status to 'review' for anything that has been edited since the last release (e.g. for sporadic errors found). It just means their version number should be bumped (maybe we want to think of a special status for this, like 'modified' or something)

ctschroeder commented 4 years ago

Ok thanks for this clarification. I’ll assign them to myself to look at the metadata. I would rather not add another status, actually, without reviewing all our current status levels, since they seem to have proliferated a bit and there are some overlaps.

That’s great news about auto populating the version date! Yay!

ctschroeder commented 4 years ago

Hello editors @amir-zeldes @lancealanmartin @cluckmarq @bkrawiec Just a quick reminder that our goal is to finish the editorial review this week (or hopefully by Monday). If you need more time, please LMK. Amir can probably get started on some corpora while you're finishing. Thanks for all your work! You can reply here or send me an email.

bkrawiec commented 4 years ago

I am aiming for Monday.


On Feb 27, 2020, at 3:41 PM, Caroline T. Schroeder wrote:

CAUTION: This email was sent from outside of Canisius College. Do not open attachments or click on links from unknown senders or unexpected emails.

Hello editors @amir-zeldes @lancealanmartin @cluckmarq @bkrawiec Just a quick reminder that our goal is to finish the editorial review this week (or hopefully by Monday). If you need more time, please LMK. Amir can probably get started on some corpora while you're finishing. Thanks for all your work! You can reply here or send me an email.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe

amir-zeldes commented 4 years ago

I'm still working on that last Johannes (others are done)

lancealanmartin commented 4 years ago

I am finished with shenoute.those.

cluckmarq commented 4 years ago

sorry to be slow, i'm aiming for monday as well.

cluckmarq commented 4 years ago

fyi, should be done with my texts around noonish est

ctschroeder commented 4 years ago

great. thanks Christie!

ctschroeder commented 4 years ago

@amir-zeldes Victor and Aphou are ready. Several more corpora should be ready tonight.

amir-zeldes commented 4 years ago

Fantastic, I'm done with review of Johannes, so I will hopefully find time to go over the overall GitDox validation situation soon.

ctschroeder commented 4 years ago

ok checking them off as I go through. Stuck a bit on some of the Marcion material but will go check again tonight/tomorrow to see where we are. Plenty for you to work on in the meantime tho :)

ctschroeder commented 4 years ago

Great job everyone!!!