dlareklami / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Fix UTF8 reading errors in text2image #1133

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The DropUncoveredChars function was trying to be too clever for its own good. 
To save memory it was modifying a UTF8 string while reading from it. This 
mostly worked fine, but with long files occasionally didn't correctly advance 
to the next character. This caused UTF8 errors as utf8_step() was called in the 
middle of a UTF8 character.

The fix just simplifies the code to write into a different string than is being 
read. This uses a little more memory, but the string processing will not be a 
bottleneck at this point anyway.

Attached is a patch, as well as an example training_text.txt which triggers the 
bug.

Original issue reported on code.google.com by nick.wh...@durham.ac.uk on 13 Mar 2014 at 8:01

Attachments:

GoogleCodeExporter commented 9 years ago
This problem appears to be a bit of a phantom.
Not reproduced for me with the given file. (probably cleaned of illegal utf8 
somewhere in the upload/download process.)
Looks like too many utf8 error messages were being issued for a single bad utf8 
encoding. This is now fixed, but in a different way. Will still produce one 
error message for each bad byte in the uft8 sequence.

Original comment by theraysm...@gmail.com on 24 Apr 2014 at 9:17

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r1080.

Original comment by theraysm...@gmail.com on 24 Apr 2014 at 9:18

GoogleCodeExporter commented 9 years ago
After more testing I found that the UTF-8 reading error still exists, but only 
appears with some fonts.

I suspect it's some weird memory corruption thing therefore, but haven't looked 
into it more yet (I plan to soon).

In the meantime I'm attaching a (very minimal) example training_text.txt file, 
that fails with the attached 'GFS Didot' font, but succeeds with the 'Linux 
Libertine O' font.

Running this command: text2image --text training_text.txt --outputbase test 
--font 'GFS Didot' --fonts_dir .

I get this output:
Initializing fontconfig
WARNING: Illegal UTF8 encountered
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffff90
Index 1 char = 0xffffffbc
Index 2 char = 0xffffff90
Index 3 char = 0xa
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffffbc
Index 1 char = 0xffffff90
Index 2 char = 0xa
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffff90
Index 1 char = 0xa
WARNING: Dropped 1 uncovered characters
(process:25315): Pango-WARNING **: Invalid UTF-8 string passed to 
pango_layout_set_text()
WARNING: Illegal UTF8 encountered
WARNING: Illegal UTF8 encountered
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffffff
Index 1 char = 0xa
WARNING: Illegal UTF8 encountered
WARNING: Illegal UTF8 encountered
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffffff
Error in boxaGetExtent: boxa not defined
Error in boxaAddBox: box not defined
Rendered page 0 to file test.tif

Whereas this command: text2image --text training_text.txt --outputbase test 
--font 'Linux Libertine O' --fonts_dir 
/usr/share/fonts/opentype/linux-libertine/

Gives this output:
Initializing fontconfig
Rendered page 0 to file test.tif

I plan to investigate this more soon.

Original comment by nick.wh...@durham.ac.uk on 16 Jun 2014 at 9:07

Attachments:

GoogleCodeExporter commented 9 years ago
OK, I found a bug which was causing some bad errors. If HAVE_GETLINE is not 
defined, and you have lines longer than BUFSIZ, they are cut short, potentially 
in the middle of UTF-8 characters.

Attached is a patch that fixes that by replacing ReadLine() with a much simpler 
Read() routine.

However, I still have failures unless my first patch in the initial comment is 
applied. The string is read correctly, and then is corrupted by the 
DropUncoveredChars() routine. Please do apply it; it isn't a phantom. Can you 
still not reproduce it, with the first attached training_text.txt? If not, I 
suspect it might be something that's compiler optimisation dependent, as it 
involves rewriting character strings on the fly in a way that maybe GCC can 
guess incorrectly about.

Original comment by nick.wh...@durham.ac.uk on 18 Jun 2014 at 9:24

Attachments: