HarshUpadhyay / TesseractTrainer

A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Other
130 stars 37 forks source link

Incorrect box results #2

Closed goncalopp closed 12 years ago

goncalopp commented 12 years ago

Trying to use the latest version (v0.0.4 tag), with any text on any font, produces boxfiles with strange vertical coordinates, thus subsequently failing blob detection.

Tested with: http://openfontlibrary.org/assets/downloads/didact-gothic/7fe50f6001d2b721023972398c04ddf3/didact-gothic.zip text: "test" size: 20

result: t 20 580 26 552 0 e 26 580 37 552 0 s 37 580 46 552 0 t 46 580 52 552 0

goncalopp commented 12 years ago

my bad, it seems the coordinates are ok, the gui editor I was using was messing them up

brouberol commented 12 years ago

Haha, I've spent the last hour reinstalling tesseract & dependencies and testing my coordinates system, and I just came to the same conclusion using a GUI editor: they are ok. Yay for me!

Now, about your "FAILURE! Couldn't find a matching blob" problem. If you play with the font size, you'll see that the number of occurrences of this error will vary. I think this has to do with the fact that the inter-character spacing is quite low with this font.

See this message (in https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images)

It is ABSOLUTELY VITAL to space out the text a bit when printing, so up the inter-character and inter-line spacing in your word processor. Not spacing text out sufficiently will cause "FAILURE! box overlaps no blobs or blobs in multiple rows" errors during tr file generation

One could try to manually insert a certain amount of space after each character, taking it into account when calculating the character coordinates (in PIL and tesseract coordinates-system).

Another solution, which has been strongly suggested to me, is to re-write the whole tif generation process using ImageMagick instead of PIL, and to manually insert each character in a pre-defined box of 50*50px (for example). This means that we then would simply define the character spacing and set the font size once and for all. One important advantage of this solution is that TesseractTrainer would then be usable by python3 users, as PIL has not (yet?) been ported to py3k.

Unfortunately, I lack time to maintain this project. Do not hesitate to fork it if these improvements seem sensitive to you :)

goncalopp commented 12 years ago

Sorry for the trouble!

Yes, I was aware of the character separation issue. In my environment, even putting a whitespace between each character doesn't seem to work for the simple test.

In fact, I found your script after implementing something simple with PIL that does the "square grid" behaviour you describe (and getting the same training failures) - there's no need for ImageMagick (for that much). I'd make a patch, but it seems tesseract wants nothing to do with me, so I wouldn't be able to test it :(

hairui commented 11 years ago

Why was the issue closed ? I am still meeting with problems like this.

brouberol commented 11 years ago

The issue was closed because the box results were indeed correct, and it seemed to be the GUI editor which was at fault.

Feel free to open a new issue if you encounter a different problem.