Closed mikegerber closed 3 years ago
This looks like a bug in https://github.com/PRImA-Research-Lab/prima-page-converter. Guessing here, but perhaps the glyphs in question are NFD-normalized, i.e. combine two codepoints, which is wrongly interpreted to be two characters.
But I'll try to reproduce and make sure it's not an issue with UB-Mannheim/ocr-fileformat.
'uͤ'
is always two codepoints, even in NFC, because there is no single codepoint to represent it
In [7]: len(unicodedata.normalize('NFC', 'uͤ'))
Out[7]: 2
In [8]: len(unicodedata.normalize('NFD', 'uͤ'))
Out[8]: 2
(Only in MUFI, but that's PUA)
The first problem seems to be that ocr-transform
does not return a proper error code:
% ocr-transform page alto OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml TMP.25711/TMP.25711_00000024.xml; echo $?
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
[ ... etc ... ]
0
That should not be 0!
The first problem seems to be that ocr-transform does not return a proper error code:
We just pass on the return code of calling the JPageConverter.jar which seems to be a false success here.
https://github.com/PRImA-Research-Lab/prima-page-converter/blob/master/src/org/primaresearch/dla/page/converter/PageConverter.java looks like JPageConverter.jar never exits with an exit code.
https://github.com/PRImA-Research-Lab/prima-page-converter/blob/master/src/org/primaresearch/dla/page/converter/PageConverter.java looks like JPageConverter.jar never exits with an exit code.
Ah, dang, I thought that was fixed as part of https://github.com/PRImA-Research-Lab/prima-page-converter/issues/16 - can you open an issue for this upstream pls?
I will probably just fix it, i.e. submit a minimal PR, as it looks easy enough
even better, obviously 👍
I think the fix is easy (sprinkle a bunch of System.exit(1)
s), but building this Java stuff is hard (at least for, me) without any build mechanism... Also the libraries they depend on require manual building, because the latest core library release is out of date, I think.
The ALTO conversion is rather important, so I think we should somehow improve this situation?
The ALTO conversion is rather important, so I think we should somehow improve this situation?
As workaround until https://github.com/PRImA-Research-Lab/prima-core-libs/issues/10 is resolved and we can contribute to upstream, we could catch the string "Error writing target ALTO XML file" and exit with non-zero return value. Would that help?
Because we would still have the issue of the failing conversion, I don't think it's worth the effort of just working around the error handling, IMHO
JPageConverter 1.5.05 (just released) seems to fix the conversion, so I submitted a PR to ocr-fileformat (https://github.com/UB-Mannheim/ocr-fileformat/pull/131).
JPageConverter 1.5.05 does not fix conversion for my example file OCR-D-OCR-TESS_00000024.zip, reproduce with:
prima-page-converter -source-xml OCR-D-OCR-TESS_00000024.xml -target-xml alto.xml -convert-to ALTO
@kba There is one clean thing I see that ocrd-fileformat-transform
could do to mitigate these problems: Check if the target file was created and handle a missing target file as an error.
Check if the target file was created and handle a missing target file as an error.
Sure, can do.
The underlying issue is:
aͤ
as one Glyph
: <pc:Glyph id="l9_word0012_glyph0000">
<pc:Coords points="1635,2853 1655,2853 1655,2894 1635,2894"/>
<pc:TextEquiv index="0" conf="0.981777420043945">
<pc:Unicode>aͤ</pc:Unicode>
</pc:TextEquiv>
</pc:Glyph>
Glyph
s: <pc:Glyph id="l9_word0012_glyph0000">
<pc:Coords points="2511,3682 2517,3682 2517,3738 2511,3738"/>
<pc:TextEquiv index="1" conf="0.996030032634735">
<pc:Unicode>a</pc:Unicode>
</pc:TextEquiv>
</pc:Glyph>
<pc:Glyph id="l9_word0012_glyph0001">
<pc:Coords points="2517,3682 2529,3682 2529,3738 2517,3738"/>
<pc:TextEquiv index="1" conf="0.976686894893646">
<pc:Unicode>ͤ</pc:Unicode>
</pc:TextEquiv>
</pc:Glyph>
While ocrd_calamari represents it as two Glyphs:
Because I had a related problem yesterday: A simple pattern to iterate over graphemes rather than characters is:
from regex import findall
for grapheme in findall('\X', 'aͤa'):
print(grapheme, len(grapheme))
>>> aͤ 2
>>> a 1
EDIT: Sorry, I didn't read the ocrd_calamari regonize implementation, the glyphs come like that from calamari itself, not from iterating over bytes.
Yeah, it's not yet clear to me which one is correct, because of subtle differences between grapheme( cluster)s, glyphs etc.
But that is a side quest, for now I think this issue can be closed, because
I opened an issue here for the two-vs-one-glyph problem:
Using the workspace actevedef_718448162.first-page.ocrd_fileformat_fail.zip I get the following error:
The file
TMP.25711/TMP.25711_00000024.xml
does not exist, so thatSuccessfully executed
is misleading ;-)OCR-D-OCR-TESS was created using ocrd_tesserocr, so maybe there is a problem there too.