AmitGorvadiya / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tesseract OCR doesn't recognize digits? #164

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi all,

I try the Tesseract OCR, everything is working fine until I tried to 
recognize image with some digits.
There is a special DB for digits?
Here are some more questions:
Where can I find mor eng files?
Is it possible  recognize char by char and not a whole word?

thanks.

Original issue reported on code.google.com by oha...@gmail.com on 4 Nov 2008 at 2:32

GoogleCodeExporter commented 9 years ago
For digits only see the FAQ.
Recognition by char andby word will be available in 3.00.

Original comment by theraysm...@gmail.com on 14 Nov 2008 at 3:41

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 14 Nov 2008 at 3:41

GoogleCodeExporter commented 9 years ago
To be able to recognize long digit sequences, the best way i have found was the 
following:
- Being able to know where on the document are those digits (And that's most 
likely the case when you have 
to be able to do that). For this purpose, you can try to develop a small layout 
analysis tool, which then either 
extracts pieces of images, or run tesseract on pieces if you're using the C++ 
API.
- Create a custom language preset, containing all the training you can on all 
the fonts you think there can be.
- Associated to this custom language preset, create an empty dictionary for 
disambiguation only containing 
"abcdefghijklmnopqrstuvwxyz". This is to force tesseract to avoid making 
"matches" based on language-based 
word checks.
- Last of all, don't forget the disambiguation file. 0 looks like a 8. 3 looks 
like an 8. But only put digits 
resemblance, because this language pack will only be used for digits.
- Then, launch either tesseract on the chunks of data you've extracted, or 
start it with the C++ API, in both 
case with your custom language file (Mine is a set of files named dgt.*).

Hope that helps,
Pierre.

Original comment by hicksc...@gmail.com on 4 Apr 2010 at 11:37

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 20 May 2010 at 2:15