charlesw / tesseract

A .Net wrapper for tesseract-ocr
Apache License 2.0
2.29k stars 745 forks source link

Set Variable again, russian chars #120

Closed Viton-zizu closed 10 years ago

Viton-zizu commented 10 years ago

This "SetVariable" not work, how i can do white list chars? engine.SetVariable("tessedit_char_whitelist", "АБВГД...etc");

charlesw commented 10 years ago

Try encoding the value as an ansii value using unicode escape sequences. I thought id done this automatically but the code must be in the 3.03 branch. On 11 Sep 2014 09:22, "Viton-zizu" notifications@github.com wrote:

This "SetVariable" not work, how i can do white list chars? engine.SetVariable("tessedit_char_whitelist", "АБВГД...etc");

— Reply to this email directly or view it on GitHub https://github.com/charlesw/tesseract/issues/120.

charlesw commented 10 years ago

Check out: http://stackoverflow.com/questions/1615559/converting-unicode-strings-to-escaped-ascii-string

Viton-zizu commented 10 years ago

try this, not work engine.SetVariable("tessedit_char_whitelist", "\u0410"); "\u0410" = "А" russian letter

AndreyAkinshin commented 10 years ago

I have the same problems with Russian symbols. Transform to UTF-8 doesn't help. Version 3.03 doesn't help too.

But I think I know the solution. Check out this StackOverflow question: http://stackoverflow.com/questions/9794029/python-tesseract-ocr-get-digits-only

The line

SetVariable("tessedit_char_whitelist", someChars);

should be run before initializing.

In the previous version of the Tesseract wrapper, the initialization method was existed separately from constructor. So, I could do this:

engine = new TesseractEngine(@"./tessdata", "rus", EngineMode.Default)
engine.SetVariable("tessedit_char_whitelist", rusChars);
engine.Init();

But in the current version I can't do it because the initialization method was moved into the constructor. Please, fix it.

charlesw commented 10 years ago

Okay, I'm going to have a look into this today. On 20 Sep 2014 00:34, "Andrey Akinshin" notifications@github.com wrote:

I have the same problems with Russian symbols. Transform to UTF-8 doesn't help. Version 3.03 doesn't help too.

— Reply to this email directly or view it on GitHub https://github.com/charlesw/tesseract/issues/120#issuecomment-56185115.

charlesw commented 10 years ago

Same issue as Issue #68, I'll backport the fix from 3.03 and see if that helps.

AndreyAkinshin commented 10 years ago

It works now, thanks. Can you merge it into master branch and publish via NuGet?

charlesw commented 10 years ago

Yes, I'll look at doing that tomorrow want to do a little more testing first as there's quite a few changes since last release. On 20 Sep 2014 18:35, "Andrey Akinshin" notifications@github.com wrote:

It works now, thanks. Can you merge it into master branch and publish via NuGet?

— Reply to this email directly or view it on GitHub https://github.com/charlesw/tesseract/issues/120#issuecomment-56261011.

Viton-zizu commented 10 years ago

Great!