Open MihoMahi opened 2 years ago
Thanks @MihoMahi for the report!
Does this only happen in the default textequiv_level=word
, or also with textequiv_level=glyph
?
@bertsky thank you for the hint, in fact I have tested extequiv_level=glyph
before but have seen a lot of glyphs which I couldn't assign. Now I have examined the generated xml again and found that the word itself is represented correctly. Now I realize that the too many preceding letters simply list out many recognition results on glyph level with their confidence score.
Yes, you might want to ignore the glyph level, as it contains alternative OCR hypotheses.
But the difference in the word level tells us that the blame is actually on Tesseract: it yields the wrong order when querying the result iterator on word level (and – I presume – on line and region level) for RTL script.
(The reason that the standalone CLI with ALTO renderer gets it right is merely because that only uses the glyph/symbol level iterator.)
@stweil I have not seen any examples for using the iterators on RTL data – is this a bug in Tesseract, or can we do something about it here (perhaps using ParagraphIsLtr
)?
when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example: generated word with wrong sequence of letters:
but the line containing the recogized word should look like this:
(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)
Here is the equivalent portion of the image:
REMARK: when using tesseract as standalone and generating alto, the sequence is correct!