charlesw / tesseract

A .Net wrapper for tesseract-ocr
Apache License 2.0
2.27k stars 742 forks source link

Trying to load custom traineddata for unclear fonts #537

Closed wario2k closed 3 years ago

wario2k commented 3 years ago

This is related to Issue #533. I have been having issues detecting numbers using the eng.traineddata provided at https://github.com/tesseract-ocr/tessdata_best

My input data is divided into cells as can be seen below. The engine returns accurate results for 70% of the cells but things if there is a negative sign i,e "-" in the image then it tends to detect it as an "=" sign. When i pass this image below through the engine I don't get the number back and just get "=".

PDVM_2_Cell_3X1

Where as the images below get detected correctly. PDVM_3_Cell_0X0

PDVM_2_Cell_0X0

PDVM_1_Cell_3X2

This happens with a bunch of other numbers as well so in order to resolve this issue I trained my own model on a small data set just for testing and it worked great when I used it in the tesseract executable which I downloaded from https://github.com/UB-Mannheim/tesseract/wiki customModel.zip

But when I try to load this model like so: using (var engine = new TesseractEngine(@"./tessdata", "quadrants", EngineMode.Default))

I get the following exception : Error opening data file tessdata/quadrants.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'quadrants' Tesseract couldn't load any languages!

I also tried using two models using (var engine = new TesseractEngine(@"./tessdata", "eng+quadrants", EngineMode.Default)) But I ran into the same problem.

Is there a way for me to use this custom "language"? Any help would be greatly appreciated.

wario2k commented 3 years ago

@charlesw any ideas?

charlesw commented 3 years ago

Sorry not really haven't tried custom languages. I would switch to using an absolute path for tessdata path though.

On Sat, 23 Jan 2021, 02:39 Aayush Shrestha, notifications@github.com wrote:

@charlesw https://github.com/charlesw any ideas?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/537#issuecomment-765492578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB7HSDEKTBGZXDLOKOT7VDS3GL3TANCNFSM4WNGYI6Q .

wario2k commented 3 years ago

For anyone struggling with this issue, generating custom trainneddata file using the methods outlined in https://github.com/tesseract-ocr/tesstrain seems to work fine. You do require a linux based environment to generate custom models but once generated it can be used as an input to the tesseract engine object as shown below:

using (var engine = new TesseractEngine(@"./tessdata", "customModel", EngineMode.Default))