ocrd-tesserocr-recognize produces glyph segmentation that ocrd-fileformat-transform can't convert to ALTO

mikegerber commented 3 years ago

Using this workspace: actevedef_718448162-bug-ocrd_tesserocr-vs-page-converter.zip

And these commands:

ocrd-tesserocr-recognize --overwrite -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS  -P model "GT4HistOCR_2000000" -P textequiv_level glyph
ocrd-fileformat-transform --overwrite -I OCR-D-OCR-TESS -O OCR-D-OCR-TESS-ALTO

I get these error messages:

15:34:39.308 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-OCR-TESS_00000024 (PHYS_0024)
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, '#AnonType_CONTENTGlyphType'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, '#AnonType_CONTENTGlyphType'.
[... similar messages omitted ...]
15:34:44.545 ERROR ocrd-fileformat-transform - Transformation exited with return value 0 but no file was written.

This only affects glyphs, not words. Or in other words: -P textequiv_level word works fine, so I don't have a pressing problem with it. :)

(ocrd_calamari uses two glyphs for grapheme clusters like "oͤ'". I am not sure if this correct, but at least ocrd-fileformat-transform/page-converter is happy with it.)

mikegerber commented 3 years ago

The conversion fails completely, so it's not only error messages: "Transformation exited with return value 0 but no file was written."

bertsky commented 3 years ago

cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.

That's because ALTO does not allow more than 1 code-point for CONTENT qua XSD restriction:

https://github.com/altoxml/schema/blob/682bed5085b1c369debdaa8f1530ab3ffeae8540/v4/alto-4-2.xsd#L1039

I think this is a bug in the ALTO schema. It should at least allow for an arbitrary number of combining codepoints.

Anyway, this is not an ocrd_tesserocr issue. The representation as 1 Glyph in PAGE is correct.

Probably not even an ocrd_fileformat or ocr-fileformat issue, but a prima-page-converter issue.

But I'd say discuss with the ALTO people first.

mikegerber commented 3 years ago

But I'd say discuss with the ALTO people first.

@cneud

mikegerber commented 3 years ago

@cneud I agree with @bertsky, ALTO should allow a length > 1 to allow for glyphs with combining characters; there are some, like oͤ', that have no single-codepoint representation.

But: Someone with more experience with XML Schema and this particular Unicode problem should probably look at this, maybe there is a way to say "one grapheme cluster".

bertsky commented 3 years ago

cf. https://github.com/altoxml/schema/issues/44

OCR-D / ocrd_tesserocr

ocrd-tesserocr-recognize produces glyph segmentation that ocrd-fileformat-transform can't convert to ALTO #171