brobertson / Lace2

In-broswer OCR editing program that transforms OCR results into structured, citable TEI. No XML experience required!
http://trylace.org
GNU General Public License v3.0
27 stars 2 forks source link

hyphenation resolution leaves space #142

Closed lcerrato closed 3 years ago

lcerrato commented 3 years ago

I'm seeing this in the raw file dropped into GitHub (so pre-transformations): urn:cts:greekLit:tlg0557.tlg005.1st1K-grc1

raw files found here https://github.com/OpenGreekAndLatin/First1KGreek/issues/2320

https://babel.hathitrust.org/cgi/pt?id=mdp.39015065316021&view=2up&seq=602&size=175

<tei:div type="textpart" subtype="1" n="urn:cts:greekLit:tlg0557.tlg005.1st1K-grc1:17"><tei:p>17. (28). Μέτρον ἔστω σοι παντὸς σίτου καὶ ποτοῦ ἡ πρώτη τῆς ὀρέξεως ἔμπλησις, ὄψον δὲ καὶ ἡδονὴ αὐτὴ ἡ ὄρεξις· καὶ οὔτε πλείονα τῶν δεόντων προσοίσῃ οὔτε ὀψοποιῶν δεηθήσῃ πο τῷ τε τῷ παραπεσόντι ἀρκεσθήσῃ.

where πο τῷ should be ποτῷ

It's happened on every hyphenated word in this text so far.

lcerrato commented 3 years ago

What is weird is that ἀπολύθητι 38. (44). breaks as follows

<tei:div type="textpart" subtype="1" n="urn:cts:greekLit:tlg0557.tlg005.1st1K-grc1:38"><tei:p>38. (44). ᵃΕἰ βούλει δούλων ἐκτὸς ὑπάρχειν, αὐτὸς ἀπολύθη τι δουλείας·

but you would expect ἀπολύ θητι based on https://babel.hathitrust.org/cgi/pt?id=mdp.39015065316021&view=2up&seq=606&size=175

brobertson commented 3 years ago

Thanks for this. I can see where the problem lies. Happily it is not in current Lace code, as far as I can see, but rather a slightly corrupted editing xar file. I suspect this happened because the editing xar file and its original OCR is very old, so the good news is that recent file (by far the majority we're working with now) shouldn't have this problem, at least not caused by this error mode.

brobertson commented 3 years ago

Let's move this to the corresponding Trello card, because it is a publication issue with this editing xar file (and perhaps other early ones). I'll figure out a way to cure it and test others for the same problem before they are rendered as tei xml.