PerseusDL / canonical-latinLit

XML Canonical resources for Latin Literature
https://scaife.perseus.org
Creative Commons Attribution Share Alike 4.0 International
42 stars 56 forks source link

(global) Set of OCR errors — invalid Latin words #335

Closed PonteIneptique closed 3 years ago

PonteIneptique commented 5 years ago

Hi there ! On twitter, I saw Logeion sharing this feedback from a reader : https://twitter.com/LogeionGkLat/status/1187039402142371840

There is few turn->tum, at least one earn->eam errors. The tweet seems to refer to more kind of errors though.

PonteIneptique commented 5 years ago

The same error can be found in (excluding notes.) :

https://github.com/PerseusDL/canonical-latinLit/blob/c1c36f48ed4a6bd0bc87618103e18391e0d1d512/data/phi1017/phi011/phi1017.phi011.perseus-lat2.xml#L230

https://github.com/PerseusDL/canonical-latinLit/blob/c1c36f48ed4a6bd0bc87618103e18391e0d1d512/data/phi0474/phi046/phi0474.phi046.perseus-lat1.xml#L87

I am probably gonna propose a PR, using search and replace with a filter...

helmadik commented 5 years ago

I tested for the string rn in unparsed words. The original issues reported included dignitatern, turn, also priris for primis, unfortunately.. Since the cluster rn doesn't show up at word beginning or word end, that also gives you a way to find a lot of them.

lcerrato commented 5 years ago

@helmadik @PonteIneptique Yes there are extensive OCR issues in Latin and Greek texts. There are lots of issues with that early Latin OCR batch (as well as the Greek). It was an in house experiment with commercial OCR. There was supposed to be a run through morpheus for invalid forms when these texts were entered — I see plenty of evidence that didn't always happen.

When we complete CTS conversion (I am not doing so in Latin presently) there is more of a read through wherein we (I) spot these types things and then I usually globally search on them.

In the meantime, there are issues I am working on on the Greek side and there have been lots of corrections to texts in the Scaife Viewer for which I have regular penpals.

Adding to this issue with a list of things to look for would be appreciated. A pull request is not ideal in the present workflow.

lcerrato commented 5 years ago

I should also add that I keep typos issues open in both repos in order to track this sort of thing. And thanks for the heads up.

PonteIneptique commented 5 years ago

In the context of my PhD, it would make sense that I fix these, and would be happy to. I would also be happy to separate the concerns (turn in 1 PR, earn in another) and checking them manually. Would you disagree with that ? It's just that a clean corpus is definitely important and as I use github release for citability, this is important to me. I'd like to avoid forking and citing my version if I can. Anyway, thanks to you both :)

helmadik commented 5 years ago

-rnus for 1pl -mus also showed up a number of times. I attach a screenshot of some of my edit actions in case they are helpful. image

helmadik commented 5 years ago

@PonteIneptique another place to start could be the handful of files where correction was done and the original OCR error was preserved in the xml. Searching for <corr I get the following (including one English translation where several 'arguments' have been replaced with 'argumentis' :-)) image

PonteIneptique commented 5 years ago

I think I'll stop myself at turn, earn and \Wrn as they are evident. For the other, we'll see another time but I do note them :)

lcerrato commented 5 years ago

Hi @PonteIneptique

I understand your concern. But longstanding (10 or 20 years old) typographical errors are less of a priority than the serious conversion errors, irregularities, and omissions that I am sorting through on a daily basis.

I would really not prefer any large PRs right now. Single text PRs are preferable based on what I am trying to accomplish with the regularization of the collection.

I am happy to make global corrections to any patterns you have noticed as above (turn and earn) and feel free to add others here.

Once I am caught up with the backlog of stuff I will take a look at this and update here.

I should add in the long term, it has always been the plan to add more editions and use the new data to crosscheck the existing data but conversion needs to be accomplished in order to get more of this data out there for reuse. The priority is more editions and more (sometimes raw) data.

PonteIneptique commented 5 years ago

Thanks a lot @lcerrato :) By the way, in some of the diff, I noticed some more issues with hyphenization ( https://github.com/PerseusDL/canonical-latinLit/blob/6b71b93870f9a9d05f1ce9a50a5f90a15d55e737/data/phi0474/phi042/phi0474.phi042.perseus-lat1.xml#L141 ) which seems easy to fix as they kept new lines in there :)

lcerrato commented 3 years ago

fixed via #339