Closed GoogleCodeExporter closed 9 years ago
For digits only see the FAQ.
Recognition by char andby word will be available in 3.00.
Original comment by theraysm...@gmail.com
on 14 Nov 2008 at 3:41
Original comment by theraysm...@gmail.com
on 14 Nov 2008 at 3:41
To be able to recognize long digit sequences, the best way i have found was the
following:
- Being able to know where on the document are those digits (And that's most
likely the case when you have
to be able to do that). For this purpose, you can try to develop a small layout
analysis tool, which then either
extracts pieces of images, or run tesseract on pieces if you're using the C++
API.
- Create a custom language preset, containing all the training you can on all
the fonts you think there can be.
- Associated to this custom language preset, create an empty dictionary for
disambiguation only containing
"abcdefghijklmnopqrstuvwxyz". This is to force tesseract to avoid making
"matches" based on language-based
word checks.
- Last of all, don't forget the disambiguation file. 0 looks like a 8. 3 looks
like an 8. But only put digits
resemblance, because this language pack will only be used for digits.
- Then, launch either tesseract on the chunks of data you've extracted, or
start it with the C++ API, in both
case with your custom language file (Mine is a set of files named dgt.*).
Hope that helps,
Pierre.
Original comment by hicksc...@gmail.com
on 4 Apr 2010 at 11:37
Original comment by theraysm...@gmail.com
on 20 May 2010 at 2:15
Original issue reported on code.google.com by
oha...@gmail.com
on 4 Nov 2008 at 2:32