CopticScriptorium / OCR

repository for development files and documentation for Coptic OCR in development by Coptic Scriptorium
GNU General Public License v3.0
1 stars 0 forks source link

Publication thread summer/fall 2024 OCR documents #1

Open ctschroeder opened 2 months ago

ctschroeder commented 2 months ago

Do not close this issue until all checkboxes below are complete or have been rescheduled:

List of corpora:

In Processed OCR folder (needs sentence splitting+full automatic NLP processing like the bible corpora)

In GitDox

amir-zeldes commented 2 months ago

Do we have chapter splits for the OCR data somewhere? We can do versification using the automatic sentencer for now, but we don't really have a tool for predicting chapters.

ctschroeder commented 2 weeks ago

@amir-zeldes I am manually adding chapter divisions where there are none currently

In Processed OCR folder, everything in Budge directory should be ready.

ctschroeder commented 2 weeks ago

@amir-zeldes as you can see we have a lot of docs. I may not be able to get them all ready for October.

amir-zeldes commented 1 week ago

I may not be able to get them all ready for October

No rush at all, it looks like we have plenty! Just one question though - I thought the ones in GD were the priority rather than the ones in the repo. Should we de-prioritize some of the GD ones or are you still planning to release all/some of those?

ctschroeder commented 1 week ago

I will get to the GitDox ones as soon as I am done with the Helias collection. I thought these would be easier (no idea why) and would give you something to test for the automatic process

ctschroeder commented 1 week ago

ok Helias is ready @amir-zeldes. It took longer than expected for various reasons. One thing -- there is a DS_store file in there that needs to be deleted. Moving to gitdox files next (prob tomorrow)

ctschroeder commented 5 days ago

@amir-zeldes Do all of these may need translation spans? (the ones in GitDox? the ones in GitHub?)

@amir-zeldes note above NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly in at least one Mercurius doc. The spreadsheet has "'_warn:emptynorm" in a bunch of cells where that word is.

Actually now that I look, that warning appears elsewhere in encomium.mercurius in places that I really don't understand why it's there?

Do I need to manually fix all those?

ctschroeder commented 5 days ago

I think the warnings that are not about phagios are some lines that begin with a pipe and the previous line ends with an underscore. I tried to find those (sometimes they prevented NLP altogther), but I guess I missed a bunch

amir-zeldes commented 4 days ago

Hi Carrie - I've got a bunch of e-mails coming on some of these topics but quick answers:

More to come!

amir-zeldes commented 2 days ago

OK, RE: mercurius, I've cleaned up the XML in doc3 and put it into a spreadsheet, and also added translation spans in all 3 docs. The warn issue was coming from groups with a leading/trailing pipe, e.g. _|ⲁ|ϥ|ⲥⲱⲧⲙ, _ⲉ|_. If metadata looks OK to you I can publish from this state in GD, entities/identities will be added automatically (though this should be reflected in the manually edited metadata in GD). I reassigned these to you in status metadata.