Closed mikegerber closed 3 years ago
The conversion fails completely, so it's not only error messages: "Transformation exited with return value 0 but no file was written."
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
That's because ALTO does not allow more than 1 code-point for CONTENT
qua XSD restriction:
I think this is a bug in the ALTO schema. It should at least allow for an arbitrary number of combining codepoints.
Anyway, this is not an ocrd_tesserocr issue. The representation as 1 Glyph in PAGE is correct.
Probably not even an ocrd_fileformat or ocr-fileformat issue, but a prima-page-converter issue.
But I'd say discuss with the ALTO people first.
But I'd say discuss with the ALTO people first.
@cneud
@cneud I agree with @bertsky, ALTO should allow a length > 1 to allow for glyphs with combining characters; there are some, like oͤ', that have no single-codepoint representation.
But: Someone with more experience with XML Schema and this particular Unicode problem should probably look at this, maybe there is a way to say "one grapheme cluster".
Using this workspace: actevedef_718448162-bug-ocrd_tesserocr-vs-page-converter.zip
And these commands:
I get these error messages:
This only affects glyphs, not words. Or in other words:
-P textequiv_level word
works fine, so I don't have a pressing problem with it. :)(ocrd_calamari uses two glyphs for grapheme clusters like "oͤ'". I am not sure if this correct, but at least ocrd-fileformat-transform/page-converter is happy with it.)