AmitGorvadiya / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Boxes too low, missing the characters #223

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. tesseract pde.img0015.tif pde.img0015 -l deu batch.nochop makebox
2. mv -f pde.img0015.txt pde.img0015.box
3.

What is the expected output? What do you see instead?
I expect the "box" rectangles to have characters inside them; instead both 
the top and bottom of each box is 12-pixels too low.

What version of the product are you using? On what operating system?
tesseract 2.03 on Linux

Please provide any additional information below.
tesseractTrainer.py can be used to see the incorrect rectangles; mind you 
I didn't believe the flaw was in tesseract's makebox until after using 
pnmcut and the Gimp to double-check the coordinates.
I'll attempt to attach the tiff image.

Original issue reported on code.google.com by erei...@shaw.ca on 23 Jul 2009 at 7:08

Attachments:

GoogleCodeExporter commented 9 years ago
With version 2.04 the result is the same.

Original comment by erei...@shaw.ca on 23 Jul 2009 at 8:09

GoogleCodeExporter commented 9 years ago
Tested with version 2.o4 output is fine attached herewith. 
With help of Irfanview tif is uncompressed and saved as 300 dpi.
OS  is winXP. generated boxfile (without using -l deu - sincei have not 
installed)
box file and log file also attached for information.

Original comment by withbles...@gmail.com on 23 Jul 2009 at 10:38

Attachments:

GoogleCodeExporter commented 9 years ago
Your output is slightly different from mine, the bounding-box rectangles have 
the 
same x-coordinates, but in yours the y-coordinates are consistently one more 
than in 
mine.  So in yours the boxes are too low by 11 pixels (rather 12) but still not 
what 
I would call fine.

Original comment by erei...@shaw.ca on 23 Jul 2009 at 4:42

GoogleCodeExporter commented 9 years ago
The likeliest place to look for this bug would seem to be in the 
boxfile-writing 
routines.  Because, for one thing it strikes me that the boxes all have 
precisely 
the correct height, suggesting that the same misadjustment has been applied to 
both 
the top and bottom, and furthermore if the fundamental blob-recognizing part of 
tesseract were as bad as these bounding-boxes suggest then tesseract couldn't 
perform as well as it does.

I've learned that the relevant routines found in ccmain/baseapi.cpp are 
TesseractRectBoxes, which calls TesseractToBoxText, which calls 
ConvertWordToBoxText, and that they do arithmetic of the sort that could lead 
the 
observed symptoms.  But that's as far as I've gotten.

Original comment by erei...@shaw.ca on 24 Jul 2009 at 3:06

GoogleCodeExporter commented 9 years ago
Confirmed against 3.00 SVN code as of today as well. Horizontally the 
coordinates
have been correct in all tests I've tried. Vertically though it's consistently 
either
too high or too low.

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/4cc74b8dd795e6
ce?pli=1

Original comment by wdin...@gmail.com on 29 Nov 2009 at 3:39

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r344.

Original comment by theraysm...@gmail.com on 20 May 2010 at 2:09

GoogleCodeExporter commented 9 years ago
Fixed at last!

Original comment by theraysm...@gmail.com on 20 May 2010 at 2:10

GoogleCodeExporter commented 9 years ago
Issue 290 has been merged into this issue.

Original comment by theraysm...@gmail.com on 20 May 2010 at 2:11