Shreeshrii / tessdata_shreetest

finetuned traineddata files for tesseract 4.0.0 for testing
153 stars 30 forks source link

How to use that data from C? #2

Closed alejandro-colomar closed 4 years ago

alejandro-colomar commented 5 years ago

Could you please explain a little how should I use this data from the C API of Tesseract?

Until now I was doing this for the english default data (eng.traineddata):

TessBaseAPIInit2(handle_ocr, NULL, "eng", OEM_LSTM_ONLY);

Should I assume that I should do the following (for digits.traineddata)?:

TessBaseAPIInit2(handle_ocr, NULL, "digits", OEM_LSTM_ONLY);

Or should I do something else?

Shreeshrii commented 5 years ago

Yes, that should work. Please put in same location/folder as the eng.traineddata.

alejandro-colomar commented 5 years ago

Thank you very much!! It worked!!

It has some trouble with the punctuation, but it's not very important for me. I'm reading prices, and I can assume that there are going to be always two decimal positions.

However I can give you the images I'm reading if they help your data be more accurate in the future :)

I don't know the font of my data.

Shreeshrii commented 5 years ago

I do not use this traineddata. It is only as a sample for trying out.

You can provide a couple of images. I can use them for testing, or try to find a font similar to that to include in a future run.

alejandro-colomar commented 5 years ago

From this images, a lot of dots were missed (more or less half of them). As expected, the symbol was not found in any of them. Also two numbers were wrong: In file 2.26e.png it read "2.16". In file 3.76e.png it read "3.16".

2 26e 2 62e 2 66e 2 78e 3 03e 3 03e_2 3 06e 3 07e 3 12e 3 12e_2 3 27e 3 27e_2 3 33e 3 33e_2 3 43e 3 45e 3 46e 3 47e 3 51e 3 63e 3 76e 3 99e 4 62e 5 45e 8 83e 8 92e

I have the original color images in .BMP format if you prefer them.

alejandro-colomar commented 5 years ago

Just removing the from the images, and dilating-eroding 1 pixel after that, gives a 100% accuracy, including the dots :)