charlesw / tesseract

A .Net wrapper for tesseract-ocr
Apache License 2.0
2.26k stars 742 forks source link

user_patterns_suffix(Bazzar) question after reading all related issues (edited) #248

Open 9tontruck opened 8 years ago

9tontruck commented 8 years ago

HI, I am trying to give a string pattern into TesseractEngine object when it is initiated. I am using "A .Net wrapper for tesseract-ocr" 3.0.1.0 in C#.

Here is my code:

C# code

using( TesseractEngine engine = new TesseractEngine( 
    @"./tessdata", 
    "eng", 
    EngineMode.Default, 
    "bazzar" ) )   // here load config from bazzar *important*
{   
    engine.SetVariable( "tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-" );
    engine.SetVariable( "language_model_penalty_non_freq_dict_word", "1" );
    engine.SetVariable( "language_model_penalty_non_dict_word", "1" );

    string user_patterns_suffix;
    engine.TryGetStringVariable( "user_patterns_suffix", out user_patterns_suffix );
    using( Page page = engine.Process( bitmap, PageSegMode.SingleLine ) )
    {
        ...
    }
}

tessdata/configs/bazzar

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

tessdata/eng.user-patterns

25\w\w\w\d\d

tessdata/eng.user-words

JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
OCT
SEP
NOV
DEC

TestImage.jpg

25MAR16

Output from tesseract:

25HAR16

I have successfully inserted user-words and user-patterns into the tesseract object. But the tesseract doesn't seem to refer to my user-words list because it keeps returning HAR instead of MAR. How can I force to read \w\w\w in the user-words list?

9tontruck commented 8 years ago

Anyone?

charlesw commented 8 years ago

Sorry not sure myself, I haven't really used the user_patterns_suffix myself. Maybe try stack overflow. On 29 Feb 2016 6:55 a.m., "9tontruck" notifications@github.com wrote:

Anyone?

— Reply to this email directly or view it on GitHub https://github.com/charlesw/tesseract/issues/248#issuecomment-190056952.

Antonf26 commented 8 years ago

9tontruck, did you end up having any success with this? Working on getting this working as well.

9tontruck commented 8 years ago

I had no luck :( I am still seeking for it. Did you make any progress?

Romex91 commented 8 years ago

Having the same problem. I monitored the files with Process Monitor while running Tesseract. Engine never access eng.user-patterns and eng.user-words.

Then I decided to debug the method: dict/dict.cpp::void Dict::Load(DawgCache *dawg_cache)

variables user_patterns_suffix user_patterns_file user_words_suffix and user_words_file have no value at the Load() start. I am sure I passed them to TessBaseAPI::SetVariable before the Load() start.

It is certainly a bug of libtesseract. Sorry, no time to investigate more.

UPD: I got it working when passing the variables directly to the Init() method. I think Init resets variables at start so TessBaseAPI::SetVariable does not work. Tesseract Engine loads user patterns inside the Init() method, hence calling TessBaseAPI::SetVariable after Init does not work too.

9tontruck, I don't know what .Net wrapper you use. Browse your .Net API for the Init method and pass variables there. If it's absent, search for another wrapper or use C++.

UPD2 I failed making it work. User words and patterns do not affect recognition results. Tesseract loads these files but it does not make sense. :(

IanGrainger commented 7 years ago

:( I'd love to be able to use the bazaar / bazzar config, too :( here's my SO question: http://stackoverflow.com/questions/40127994/how-to-give-tesseract-a-word-list-net-wrapper

apple2373 commented 7 years ago

I have the same problem.