OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

reverse order of glyphs inside words in PAGE-File for RTL languages #185

Open MihoMahi opened 2 years ago

MihoMahi commented 2 years ago

when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example: generated word with wrong sequence of letters:

               <pc:Word id="region0001_line0001_word0000">
                    <pc:Coords points="1620,372 1620,402 1703,402 1703,375 1647,376"/>
                    <pc:TextEquiv conf="0.877831573486328">
                        <pc:Unicode>رصم</pc:Unicode>
                    </pc:TextEquiv>
                </pc:Word>

but the line containing the recogized word should look like this:

                        <pc:Unicode>مصر</pc:Unicode>

(I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)

Here is the equivalent portion of the image: the word Msr

REMARK: when using tesseract as standalone and generating alto, the sequence is correct!

bertsky commented 2 years ago

Thanks @MihoMahi for the report!

Does this only happen in the default textequiv_level=word, or also with textequiv_level=glyph?

MihoMahi commented 2 years ago

@bertsky thank you for the hint, in fact I have tested extequiv_level=glyph before but have seen a lot of glyphs which I couldn't assign. Now I have examined the generated xml again and found that the word itself is represented correctly. Now I realize that the too many preceding letters simply list out many recognition results on glyph level with their confidence score.

bertsky commented 2 years ago

Yes, you might want to ignore the glyph level, as it contains alternative OCR hypotheses.

But the difference in the word level tells us that the blame is actually on Tesseract: it yields the wrong order when querying the result iterator on word level (and – I presume – on line and region level) for RTL script.

(The reason that the standalone CLI with ALTO renderer gets it right is merely because that only uses the glyph/symbol level iterator.)

@stweil I have not seen any examples for using the iterators on RTL data – is this a bug in Tesseract, or can we do something about it here (perhaps using ParagraphIsLtr)?