Publication thread summer/fall 2024 OCR documents

ctschroeder commented 2 months ago

Do not close this issue until all checkboxes below are complete or have been rescheduled:

List of corpora:

In Processed OCR folder (needs sentence splitting+full automatic NLP processing like the bible corpora)

[x] Budge material (3 docs)
- [x] chapter divisions added/checked
- [x] metadata updated
[ ] Giron Legendes (11 docs)
- [ ] chapter divisions added/checked
- [ ] metadata updated
[ ] Lacau Apocrypha (2 docs)
- [ ] chapter divisions added/checked
- [ ] metadata updated
[x] Sobhy Helias (4 docs)
- [x] chapter divisions added/checked
- [x] metadata updated

In GitDox

[ ] mercurius (formerly acacius.caeasaria)
- [ ] needs entities & identities
- [ ] needs translation span
- @amir-zeldes NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly
[ ] apocalypse.paul (2)
- [ ] corpus name needed
- [ ] other metadata updated
- possibly error in data -- translation on p. 1043 begins with folio 24a but OCR coptic begins in the middle of folio 6a p. 533; perhaps move to later
[ ] mercurius (2)
[ ] pscyril.alexandria
- [ ] On Mary still in XML mode (auto tagging?)
[ ] pscyril.jerusalem
- [ ] on the cross
  - [ ] needs corpus name
  - [ ] metadata updated
  - [ ] chapter & verse need to be updated in spreadsheet based on open tags in XML
- [ ] on Mary
  - [ ] needs corpus name
  - [ ] metadata updated
  - [ ] chapter & verse need to be updated in spreadsheet based on open tags in XML
[ ] psepiphanius
[ ] pschrysostom
- still in XML mode (auto tagging?)
[ ] pscelestinus
[ ] pstimothy.alex
[ ] psote.psoi
[ ] timothy.discourse

amir-zeldes commented 2 months ago

Do we have chapter splits for the OCR data somewhere? We can do versification using the automatic sentencer for now, but we don't really have a tool for predicting chapters.

ctschroeder commented 2 weeks ago

@amir-zeldes I am manually adding chapter divisions where there are none currently

In Processed OCR folder, everything in Budge directory should be ready.

ctschroeder commented 2 weeks ago

@amir-zeldes as you can see we have a lot of docs. I may not be able to get them all ready for October.

amir-zeldes commented 1 week ago

I may not be able to get them all ready for October

No rush at all, it looks like we have plenty! Just one question though - I thought the ones in GD were the priority rather than the ones in the repo. Should we de-prioritize some of the GD ones or are you still planning to release all/some of those?

ctschroeder commented 1 week ago

I will get to the GitDox ones as soon as I am done with the Helias collection. I thought these would be easier (no idea why) and would give you something to test for the automatic process

ctschroeder commented 1 week ago

ok Helias is ready @amir-zeldes. It took longer than expected for various reasons. One thing -- there is a DS_store file in there that needs to be deleted. Moving to gitdox files next (prob tomorrow)

ctschroeder commented 5 days ago

@amir-zeldes Do all of these may need translation spans? (the ones in GitDox? the ones in GitHub?)

@amir-zeldes note above NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly in at least one Mercurius doc. The spreadsheet has "'_warn:emptynorm" in a bunch of cells where that word is.

Actually now that I look, that warning appears elsewhere in encomium.mercurius in places that I really don't understand why it's there?

Do I need to manually fix all those?

ctschroeder commented 5 days ago

I think the warnings that are not about phagios are some lines that begin with a pipe and the previous line ends with an underscore. I tried to find those (sometimes they prevented NLP altogther), but I guess I missed a bunch

amir-zeldes commented 4 days ago

Hi Carrie - I've got a bunch of e-mails coming on some of these topics but quick answers:

ds_store - no worries, I think we need another clean repo anyway (reasons in an upcoming e-mail) you can ignore this for now
If things are getting published from the OCR repo they don't need verses/translation, just chapters, scripts will add the rest. The ones in GitDox are a legacy pipeline, I guess we can auto add translations as a one off, but there's no trivial automatic way to do it if they are in spreadsheet mode (if they're in XML it can be done by adding at least one chapter tag, then the NLP button auto-splits and numbers within each chapter)
I'm trying to make the warn:empty_norm issue impossible to trip but it's still happening - I can take a look, can you tell me which docs?

More to come!

amir-zeldes commented 2 days ago

OK, RE: mercurius, I've cleaned up the XML in doc3 and put it into a spreadsheet, and also added translation spans in all 3 docs. The warn issue was coming from groups with a leading/trailing pipe, e.g. _|ⲁ|ϥ|ⲥⲱⲧⲙ, _ⲉ|_. If metadata looks OK to you I can publish from this state in GD, entities/identities will be added automatically (though this should be reflected in the manually edited metadata in GD). I reassigned these to you in status metadata.

CopticScriptorium / OCR

Publication thread summer/fall 2024 OCR documents #1