charlesw / tesseract

A .Net wrapper for tesseract-ocr
Apache License 2.0
2.26k stars 741 forks source link

How read only character per character? #139

Closed chefjuanpi closed 9 years ago

chefjuanpi commented 9 years ago

Hi

I try to use tesserac to read a car plate. I cut only the plate code, and with filters of Aforge I try to make more easy to read the image, but the result is wear:

for example this is my plate code image in RAW:

platetextraw0

after the filters, I try to process with black background

plate0

I have better results with white background

plate19

the lecture many times are
M05 ACE

some times are H05 AEE or
AEC but nerver is the correct code

                                                       M05 ACC 

in first time I think make a training but, I don't understand why it read the 2 C with a different result one E, one C; I think tesseract try to read a word with sense, but a plate is a numbers-letters code.

It's possible configure tesseract to read only character per character or similar?

My code:

TesseractEngine _ocr; string tessdata = Application.StartupPath + @"\tessdata\"; _ocr = new TesseractEngine(tessdata, "eng", EngineMode.TesseractOnly);

_ocr.SetVariable("tessedit_char_whitelist", "ABCDEFGHJKLMNPQRSTVWXYZ1234567890");

private string Ocr(Bitmap image) { Pix pixplate = PixConverter.ToPix(image); var PlateText = _ocr.Process(pixplate); string text = ""; text = text + PlateText.GetText(); PlateText.Dispose(); return text; }

charlesw commented 9 years ago

Sorry for the delay, yes tesseract does consider the whole word and has an inbuilt dictionary. It's giving you different results each time as it's got an adaptive algorithm (see their faq).

Anyway I'd ignore that for now. First I'd dissable the dictionary (see their docs, https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html - config file section) and maybe try configuring some new user-patterns. Finally you might try training you're own language using the official tools provided by tesseract its likely to improve the accuracy given the characters aren't similar to the common fonts like arial. On 27 Nov 2014 09:12, "Pablo Aguilar Lliguin" notifications@github.com wrote:

Hi

I try to use tesserac to read a car plate. I cut only the plate code, and with filters of Aforge I try to make more easy to read the image, but the text result is wear:

for example this is my plate code image in RAW:

[image: platetextraw0] https://cloud.githubusercontent.com/assets/7308580/5209330/56293276-758b-11e4-9f14-62462bce1954.jpg

after the filters, I try to process with black background

[image: plate0] https://cloud.githubusercontent.com/assets/7308580/5209396/2d880846-758c-11e4-913c-53dd5f3f9831.jpg

but I have better results with white background

[image: plate19] https://cloud.githubusercontent.com/assets/7308580/5209525/bf43e59c-758d-11e4-9e1f-eaf9ddea5373.jpg

the lecture many times are

M05 ACE

some times are H05 AEE or

AEC but nerver is the correct code M05 ACC

I don't understand why it read the 2 C with a different result one E, one C; I think tesseract try to read a word with sense, but a plate is a numbers-letters code.

It's possible configure tesseract to read only character per character and it doesn't read the words?

My code:

TesseractEngine _ocr; string tessdata = Application.StartupPath + @"\tessdata\"; _ocr = new TesseractEngine(tessdata, "eng", EngineMode.TesseractOnly);

_ocr.SetVariable("tessedit_char_whitelist", "ABCDEFGHJKLMNPQRSTVWXYZ1234567890");

private string Ocr(Bitmap image) { Pix pixplate = PixConverter.ToPix(image); var PlateText = _ocr.Process(pixplate); string text = ""; text = text + PlateText.GetText(); PlateText.Dispose(); return text; }

— Reply to this email directly or view it on GitHub https://github.com/charlesw/tesseract/issues/139.