Bad error handling when converting from PAGE to ALTO (was: Error writing target ALTO XML file)

mikegerber commented 3 years ago

Using the workspace actevedef_718448162.first-page.ocrd_fileformat_fail.zip I get the following error:

% ocrd-fileformat-transform -I OCR-D-OCR-TESS -O TMP.$RANDOM
17:05:35.001 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-OCR-TESS_00000024 (PHYS_0024)
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.

[ ... more messages like the above ...]

cvc-attribute.3: The value 'aͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
17:05:38.950 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform page alto OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml TMP.25711/TMP.25711_00000024.xml -- 
17:05:39.621 INFO ocrd.workspace.save_mets - Saving mets '/home/mike/devel/ocrd-galley/actevedef_718448162.first-page/mets.xml'

The file TMP.25711/TMP.25711_00000024.xml does not exist, so that Successfully executed is misleading ;-)

OCR-D-OCR-TESS was created using ocrd_tesserocr, so maybe there is a problem there too.

kba commented 3 years ago

This looks like a bug in https://github.com/PRImA-Research-Lab/prima-page-converter. Guessing here, but perhaps the glyphs in question are NFD-normalized, i.e. combine two codepoints, which is wrongly interpreted to be two characters.

But I'll try to reproduce and make sure it's not an issue with UB-Mannheim/ocr-fileformat.

mikegerber commented 3 years ago

'uͤ' is always two codepoints, even in NFC, because there is no single codepoint to represent it

In [7]: len(unicodedata.normalize('NFC', 'uͤ'))                 
Out[7]: 2

In [8]: len(unicodedata.normalize('NFD', 'uͤ')) 
Out[8]: 2

(Only in MUFI, but that's PUA)

mikegerber commented 3 years ago

The first problem seems to be that ocr-transform does not return a proper error code:

% ocr-transform page alto OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml TMP.25711/TMP.25711_00000024.xml; echo $?
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
[ ... etc ... ]
0

That should not be 0!

kba commented 3 years ago

The first problem seems to be that ocr-transform does not return a proper error code:

We just pass on the return code of calling the JPageConverter.jar which seems to be a false success here.

mikegerber commented 3 years ago

https://github.com/PRImA-Research-Lab/prima-page-converter/blob/master/src/org/primaresearch/dla/page/converter/PageConverter.java looks like JPageConverter.jar never exits with an exit code.

kba commented 3 years ago

https://github.com/PRImA-Research-Lab/prima-page-converter/blob/master/src/org/primaresearch/dla/page/converter/PageConverter.java looks like JPageConverter.jar never exits with an exit code.

Ah, dang, I thought that was fixed as part of https://github.com/PRImA-Research-Lab/prima-page-converter/issues/16 - can you open an issue for this upstream pls?

mikegerber commented 3 years ago

I will probably just fix it, i.e. submit a minimal PR, as it looks easy enough

kba commented 3 years ago

even better, obviously 👍

mikegerber commented 3 years ago

I think the fix is easy (sprinkle a bunch of System.exit(1)s), but building this Java stuff is hard (at least for, me) without any build mechanism... Also the libraries they depend on require manual building, because the latest core library release is out of date, I think.

The ALTO conversion is rather important, so I think we should somehow improve this situation?

kba commented 3 years ago

The ALTO conversion is rather important, so I think we should somehow improve this situation?

As workaround until https://github.com/PRImA-Research-Lab/prima-core-libs/issues/10 is resolved and we can contribute to upstream, we could catch the string "Error writing target ALTO XML file" and exit with non-zero return value. Would that help?

mikegerber commented 3 years ago

Because we would still have the issue of the failing conversion, I don't think it's worth the effort of just working around the error handling, IMHO

mikegerber commented 3 years ago

~~JPageConverter 1.5.05 (just released) seems to fix the conversion, so I submitted a PR to ocr-fileformat (https://github.com/UB-Mannheim/ocr-fileformat/pull/131).~~

JPageConverter 1.5.05 does not fix conversion for my example file OCR-D-OCR-TESS_00000024.zip, reproduce with:

prima-page-converter -source-xml OCR-D-OCR-TESS_00000024.xml -target-xml alto.xml -convert-to ALTO

mikegerber commented 3 years ago

@kba There is one clean thing I see that ocrd-fileformat-transform could do to mitigate these problems: Check if the target file was created and handle a missing target file as an error.

kba commented 3 years ago

Check if the target file was created and handle a missing target file as an error.

Sure, can do.

mikegerber commented 3 years ago

The underlying issue is:

ocrd_tesserocr represents aͤ as one Glyph:

                    <pc:Glyph id="l9_word0012_glyph0000">
                        <pc:Coords points="1635,2853 1655,2853 1655,2894 1635,2894"/>
                        <pc:TextEquiv index="0" conf="0.981777420043945">
                            <pc:Unicode>aͤ</pc:Unicode>
                        </pc:TextEquiv>
                    </pc:Glyph>

While ocrd_calamari represents it as two Glyphs:

                    <pc:Glyph id="l9_word0012_glyph0000">
                        <pc:Coords points="2511,3682 2517,3682 2517,3738 2511,3738"/>
                        <pc:TextEquiv index="1" conf="0.996030032634735">
                            <pc:Unicode>a</pc:Unicode>
                        </pc:TextEquiv>
                    </pc:Glyph>
                    <pc:Glyph id="l9_word0012_glyph0001">
                        <pc:Coords points="2517,3682 2529,3682 2529,3738 2517,3738"/>
                        <pc:TextEquiv index="1" conf="0.976686894893646">
                            <pc:Unicode>ͤ</pc:Unicode>
                        </pc:TextEquiv>
                    </pc:Glyph>

kba commented 3 years ago

While ocrd_calamari represents it as two Glyphs:

Because I had a related problem yesterday: A simple pattern to iterate over graphemes rather than characters is:

from regex import findall
for grapheme in findall('\X', 'aͤa'):
     print(grapheme, len(grapheme))

>>> aͤ 2
>>> a 1

EDIT: Sorry, I didn't read the ocrd_calamari regonize implementation, the glyphs come like that from calamari itself, not from iterating over bytes.

mikegerber commented 3 years ago

Yeah, it's not yet clear to me which one is correct, because of subtle differences between grapheme( cluster)s, glyphs etc.

But that is a side quest, for now I think this issue can be closed, because

error handling in ocrd_fileformat seems to have been correct from the start and the culprit is prima-page-converter
mitigation of prima-page-converter not exiting with an error code is now implemented by checking if the output file exists.

mikegerber commented 3 years ago

I opened an issue here for the two-vs-one-glyph problem:

https://github.com/OCR-D/ocrd_tesserocr/issues/171

OCR-D / ocrd_fileformat

Bad error handling when converting from PAGE to ALTO (was: Error writing target ALTO XML file) #25