bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

beta.3 tesseract issues #581

Closed j0rdanit0 closed 6 years ago

j0rdanit0 commented 6 years ago

After switching to the new beta.3 version of tesseract, I am seeing some issues that were not happening when I was using beta.1.

Windows: calling the Init() method does not return 0. I'm not sure what is wrong since there is no error message. Ubuntu: calling the Init() method causes the Java process to be killed, with this error in the logs:

read_params_file: parameter not found: enable_new_segsearch

pom.xml:

        <dependency>
            <groupId>org.bytedeco.javacpp-presets</groupId>
            <artifactId>tesseract-platform</artifactId>
            <version>4.0.0-beta.3-1.4.2-SNAPSHOT</version>
        </dependency>

helper method to initialize the TessBaseAPI:

    private static tesseract.TessBaseAPI initializeApi( String languageString )
    {
        tesseract.TessBaseAPI api = new tesseract.TessBaseAPI();
        tesseract.StringGenericVector keys = null;
        tesseract.StringGenericVector values = null;
        try
        {
            keys = new tesseract.StringGenericVector();
            keys.addPut( new tesseract.STRING().addPut( "load_system_dawg" ) );
            keys.addPut( new tesseract.STRING().addPut( "load_freq_dawg" ) );
            keys.addPut( new tesseract.STRING().addPut( "load_punc_dawg" ) );
            keys.addPut( new tesseract.STRING().addPut( "load_number_dawg" ) );
            keys.addPut( new tesseract.STRING().addPut( "load_unambig_dawg" ) );
            keys.addPut( new tesseract.STRING().addPut( "load_bigram_dawg" ) );
            keys.addPut( new tesseract.STRING().addPut( "load_fixed_length_dawgs" ) );

            values = new tesseract.StringGenericVector();
            values.addPut( new tesseract.STRING().addPut( "0" ) );
            values.addPut( new tesseract.STRING().addPut( "0" ) );
            values.addPut( new tesseract.STRING().addPut( "0" ) );
            values.addPut( new tesseract.STRING().addPut( "0" ) );
            values.addPut( new tesseract.STRING().addPut( "0" ) );
            values.addPut( new tesseract.STRING().addPut( "0" ) );
            values.addPut( new tesseract.STRING().addPut( "0" ) );

            logger.info( "Initializing Tesseract with language(s): " + languageString );
            if ( api.Init( Properties.get().getTessdataPath(), languageString, tesseract.OEM_LSTM_ONLY, (ByteBuffer)null, 0, keys, values, false ) != 0 )
            {
                throw new RuntimeException( "Could not initialize tesseract." );
            }
        }
        finally
        {
            if ( keys != null )
            {
                keys.close();
                keys.deallocate();
            }
            if ( values != null )
            {
                values.close();
                values.deallocate();
            }
        }

        api.SetPageSegMode( tesseract.PSM_SINGLE_LINE );

        api.SetVariable( "tessedit_enable_doc_dict", "0" );
        api.SetVariable( "load_system_dawg", "0" );
        api.SetVariable( "load_freq_dawg", "0" );
        api.SetVariable( "load_punc_dawg", "0" );
        api.SetVariable( "load_number_dawg", "0" );
        api.SetVariable( "load_unambig_dawg", "0" );
        api.SetVariable( "load_bigram_dawg", "0" );
        api.SetVariable( "load_fixed_length_dawgs", "0" );
        api.SetVariable( "segment_penalty_garbage", "0" );
        api.SetVariable( "segment_penalty_dict_nonword", "0" );
        api.SetVariable( "segment_penalty_dict_frequent_word", "0" );
        api.SetVariable( "segment_penalty_dict_case_ok", "0" );
        api.SetVariable( "segment_penalty_dict_case_bad", "0" );
        api.SetVariable( "doc_dict_enable", "0" );
        api.SetVariable( "tessedit_enable_doc_dict", "0" );
        api.SetVariable( "language_model_penalty_non_freq_dict_word", "0" );
        api.SetVariable( "language_model_penalty_non_dict_word", "0" );

        return api;
    }

I'm also not exactly sure how all these parameters work - I've got some being defined for the Init() method, via StringGenericVector objects, and I've got others being defined after the Init() method via the SetVariable() method. None of them include the parameter that is listed in the error message: enable_new_segsearch

saudet commented 6 years ago

Does the BasicExample still work well?

j0rdanit0 commented 6 years ago

Same error message. I tweaked the paths to fit my environment:

        BytePointer outText;

        tesseract.TessBaseAPI api = new tesseract.TessBaseAPI();
        // Initialize tesseract-ocr with English, without specifying tessdata path
        if (api.Init("C:\\dev\\swgoh-service\\tessdata", "eng") != 0) {
            System.err.println("Could not initialize tesseract.");
            System.exit(1);
        }

        // Open input image with leptonica library
        lept.PIX image = pixRead( args.length > 0 ? args[0] : "C:\\Users\\JordanS\\Desktop\\ImageMatchingTest\\munky.png");
        api.SetImage(image);
        // Get OCR result
        outText = api.GetUTF8Text();
        System.out.println("OCR output:\n" + outText.getString());

        // Destroy used object and release memory
        api.End();
        outText.deallocate();
        pixDestroy(image);

Log file contents:

Connected to the target VM, address: '127.0.0.1:51990', transport: 'socket'
read_params_file: parameter not found: enable_new_segsearch
Disconnected from the target VM, address: '127.0.0.1:51990', transport: 'socket'

Process finished with exit code 1
j0rdanit0 commented 6 years ago

UPDATE: I started thinking about your comment in my other ticket. It's true that I downloaded all of my .traineddata files from the up-to-date list.. except for eng. I remembered that that one in particular came with my initial download of tesseract, not from the list. I updated it and the basic example works. Now, I'm getting this error message in my website code:

Error setting param load_fixed_length_dawgs
saudet commented 6 years ago

Looks like "load_fixed_length_dawgs" got removed, so you can remove it from your application as well: https://github.com/tesseract-ocr/tesseract/commit/18c8f8833f8f5a771c84ed4aba0ba3150964583d

j0rdanit0 commented 6 years ago

Awesome, that fixed it. As always, thanks for your help. One last thing: I'm getting this warning over and over:

Warning. Invalid resolution 0 dpi. Using 70 instead.

Even with the warning, I'm getting results that are working pretty well, but maybe it would be working better if I were to fix the issue? Not sure what the warning is implying. I am feeding tesseract a BufferedImage, which represents a cropped portion of an image that I receive via http in the form of a MultipartFile object. I'm not explicitly removing any DPI settings.. do you know why they would be missing? Also, I added code from this stackoverflow article to add DPI settings to the BufferedImage, but that didn't change the result. Do you have any suggestions or insight to alleviate this?

saudet commented 6 years ago

It wants DPI settings in the PIX image. JavaCV isn't converting it, but we can easily enough set it, I'm guessing with this: http://bytedeco.org/javacpp-presets/leptonica/apidocs/org/bytedeco/javacpp/lept.html#pixSetResolution-org.bytedeco.javacpp.lept.PIX-int-int-

j0rdanit0 commented 6 years ago

Excellent, that was it. Thanks so much!

j0rdanit0 commented 6 years ago

New error when performing Maven install:

Failed to execute goal on project swgoh-service: Could not resolve dependencies for project com.jordan:swgoh-service:jar:3.0.2: Failure to find org.bytedeco.javacpp-presets:leptonica:jar:windows-x86:1.76.0-1.4.2-20180611.143021-191 in http://repository.jboss.org/nexus/content/groups/public/ was cached in the local repository, resolution will not be reattempted until the update interval of jboss-public-repository-group has elapsed or updates are forced
        <dependency>
            <groupId>org.bytedeco.javacpp-presets</groupId>
            <artifactId>tesseract-platform</artifactId>
            <version>4.0.0-beta.3-1.4.2-SNAPSHOT</version>
        </dependency>
saudet commented 6 years ago

Make sure to use "mvn -U ..." to update your cache.

j0rdanit0 commented 6 years ago

facepalm yeah, of course.. thanks, we're all good here lol

saudet commented 6 years ago

And 1.4.2 has now been released, so make to use 4.0.0-beta.3-1.4.2!