Open ctschroeder opened 2 months ago
Do we have chapter splits for the OCR data somewhere? We can do versification using the automatic sentencer for now, but we don't really have a tool for predicting chapters.
@amir-zeldes I am manually adding chapter divisions where there are none currently
In Processed OCR folder, everything in Budge directory should be ready.
@amir-zeldes as you can see we have a lot of docs. I may not be able to get them all ready for October.
I may not be able to get them all ready for October
No rush at all, it looks like we have plenty! Just one question though - I thought the ones in GD were the priority rather than the ones in the repo. Should we de-prioritize some of the GD ones or are you still planning to release all/some of those?
I will get to the GitDox ones as soon as I am done with the Helias collection. I thought these would be easier (no idea why) and would give you something to test for the automatic process
ok Helias is ready @amir-zeldes. It took longer than expected for various reasons. One thing -- there is a DS_store file in there that needs to be deleted. Moving to gitdox files next (prob tomorrow)
@amir-zeldes Do all of these may need translation spans? (the ones in GitDox? the ones in GitHub?)
@amir-zeldes note above NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly in at least one Mercurius doc. The spreadsheet has "'_warn:emptynorm" in a bunch of cells where that word is.
Actually now that I look, that warning appears elsewhere in encomium.mercurius in places that I really don't understand why it's there?
Do I need to manually fix all those?
I think the warnings that are not about phagios are some lines that begin with a pipe and the previous line ends with an underscore. I tried to find those (sometimes they prevented NLP altogther), but I guess I missed a bunch
Hi Carrie - I've got a bunch of e-mails coming on some of these topics but quick answers:
More to come!
OK, RE: mercurius, I've cleaned up the XML in doc3 and put it into a spreadsheet, and also added translation spans in all 3 docs. The warn issue was coming from groups with a leading/trailing pipe, e.g. _|ⲁ|ϥ|ⲥⲱⲧⲙ
, _ⲉ|_
. If metadata looks OK to you I can publish from this state in GD, entities/identities will be added automatically (though this should be reflected in the manually edited metadata in GD). I reassigned these to you in status metadata.
Do not close this issue until all checkboxes below are complete or have been rescheduled:
List of corpora:
In Processed OCR folder (needs sentence splitting+full automatic NLP processing like the bible corpora)
In GitDox