brobertson / Lace2

In-broswer OCR editing program that transforms OCR results into structured, citable TEI. No XML experience required!
http://trylace.org
GNU General Public License v3.0
27 stars 2 forks source link

some exports dropping capital letters #141

Closed lcerrato closed 3 years ago

lcerrato commented 3 years ago

Just a heads up — I've spotted a few recent texts where the initial capital letter is missing from a word. This may be isolated to specific circumstances such as the texts I'm working on, but I've seen it in two different volumes this week.

brobertson commented 3 years ago

Hi, Lisa. Thanks! If you could append some xml here with the guilty texts, that would be great. Is it possible that these have been run through a script that removes Latin-script letters? In that case, errant capital-e in place of capital-epsilon would get stripped.

lcerrato commented 3 years ago

I forget which one it was yesterday because I wasn't tracking it but I saw it in urn:cts:greekLit:tlg4026.tlg003.1st1K-grc1 today

(It's possible tidy cleaned it up but these were capital pi and delta so not the type of letters that would be confused.)

I wasn't tracking it carefully (I will from now on), but here is one passage [a 16] εῖ δὲ παρὰ ταῦτα μηδένα ἄλλον τρόπον ἐρωτήσεων τῶν https://archive.org/details/commentariaina21pt12akaduoft/page/331/mode/1up?view=theater

lcerrato commented 3 years ago

Ok, this one was definitely tidy — as the original file was

<tei:div type="textpart" subtype="1" n="urn:cts:greekLit:tlg0557.tlg004.1st1K-grc1:2"><tei:p>2. Παρὰ θεῶν μὴ συνεχῆ

versus

<div type="textpart" subtype="section" xml:base="urn:cts:greekLit:tlg0557.tlg004.1st1K-grc1" n="2">
        <p rend="indent">2.  αρὰ θεῶν μὴ συνεχῆ 

very interesting.

brobertson commented 3 years ago

Phew. So unlikely to be a Lace problem, but if you can append the original file here, I can check if my MacOS tidy makes this error. We should file a bug against tidy for sure. However, in the long run, we can use XSLT to do all your postprocessing, including indenting and that should be more reliable.

brobertson commented 3 years ago

I'll leave this open until we're certain it's not a Lace issue.

brobertson commented 3 years ago

Ok, I freshly generated urn:cts:greekLit:tlg0557.tlg004.1st1K-grc1 and the uppercase pi is still there. I processed the file with Linux parallel tidy -xml -m -i {} ::: *xml and the result also has the uppercase pi.

lcerrato commented 3 years ago

This was a recent batch ldpd_10922736_000.zip

brobertson commented 3 years ago

Hi, Lisa. ldpd_10922736_000.zip is the set of files I based my comments above. When I generated them at heml.mta.ca/lace I did not see these problems in the Lace output, but rather post tidy (in macos).

brobertson commented 3 years ago

As for the initial example, I find that the Δ is missing in the editing, as shown in this image Screenshot from 2021-04-15 13-59-38

brobertson commented 3 years ago

I've asked Charlotte to do a last scan of commentariaina21pt12akaduoft as up at heml.mta.ca/lace

lcerrato commented 3 years ago

@brobertson Yes, I agree on this. I just could not be sure in the middle of the workflow (have to go back and redownload and compare, etc.). Some of these were near those Aristotle brackets so I thought that could have been creating some noise at first.

brobertson commented 3 years ago

Issue resolved because it was not caused by Lace, but either by erroneous editing or post-processing in MacOS 'tidy'.