Store image text in latest Sikuli 1.1.4_SNAPSHOT API

kumarkrish85 commented 5 years ago

I am using Sikuli 1.1.4_SNAPSHOT Sikuli XAPI jar in my tool. Earlier i was using 1.1.2 jar. Image text was fetched fine when using 1.1.2 jar. In latest jar it fetches the text but some extra characters are added in it.

I am trying to fetch the text (DemoTestCase) in the below image

demo

I use the code

https://github.com/CognizantQAHub/Cognizant-Intelligent-Test-Scripter/blob/master/Engine/src/main/java/com/cognizant/cognizantits/engine/commands/image/Text.java#L115

find target code

https://github.com/CognizantQAHub/Cognizant-Intelligent-Test-Scripter/blob/master/Engine/src/main/java/com/cognizant/cognizantits/engine/commands/image/ImageCommand.java#L268

I got the below output

instead of "DemoTestCase" it fetches some extra characters are get added like EI DemoTestCase. Please share your inputs to resolve the issue.

when i try to fetch allcommonaction text from the image (refer the image above) it returns

another

kumarkrish85 commented 5 years ago

I did refer the issue https://github.com/RaiMan/SikuliX1/issues/195 and i am using tess4j version 3.5.2

balmma commented 5 years ago

1.1.4 has a new version of Tesseract and might behave quite a bit differently to 1.1.2.

You might want to play around with the page segmentation mode. Use the following code to adjust this:

TextRecognizer tr = TextRecognizer.start();
tr.setPSM(1);

Value 1 or 12 might give you better results.

kumarkrish85 commented 5 years ago

Thanks @balmma , can yo please share the docs?

balmma commented 5 years ago

As you seem to try to detect mainly non dictionary words, deactivating the dictionaries might also help:

tr.setVariable("load_system_dawg", "false")
tr.setVariable("load_freq_dawg", "false")

The following is a great resource if you don't get the expected results from the OCR: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Here you can find some info about OCR usage in SikuliX: https://sikulix-2014.readthedocs.io/en/latest/news.html#revision-of-the-text-and-ocr-feature

kumarkrish85 commented 5 years ago

Thanks @balmma

kumarkrish85 commented 5 years ago

I did try with

TextRecognizer tr = TextRecognizer.start(); tr.setPSM(1);

and updated tesseract OCR to 3.05 , but issue still exists. I have tried with different PSM values like 1,12 and 8.

Below snippet of code i used

rx.highlight(1); TextRecognizer recog = TextRecognizer.start(); recog.setPSM(1); recog.setVariable("load_system_dawg", "false"); recog.setVariable("load_freq_dawg", "false"); List<Match> matches = rx.collectWords(); Match match = matches.get(0); String text = match.getText();

Expected words are DemoTestCase, allcommonaction and imgtextextracts, but we got the below output

sik

kumarkrish85 commented 5 years ago

Hi @RaiMan / @balmma please advice

RaiMan commented 5 years ago

Have you already tried with rx.text()?

kumarkrish85 commented 5 years ago

Yes, i have tried with rx.text() , please refer the image in first comment

balmma commented 5 years ago

Probably worth a try with the upcoming Tess4J 4.4.0.

@RaiMan Any estimates when the dev branch is going to be merged?

RaiMan commented 5 years ago

@kumarkrish85 What system you are working on?

@balmma the stuff is principally ready and will work for Windows out of the box. I have to add something to the docs for macOS (Tesseract has to be installed with homebrew or Macports) and for Linux systems (did not make any tests, but seems to be much easier to get a working Tesseract 4 than with version 3). So it might get Monday until it is online officially.

kumarkrish85 commented 5 years ago

@RaiMan

I am working on Windows 10. 8GB RAM , Visual C++ Redistributable Pack (>=2013) installed. Just to give you background on how i have updated the tessdata , just downloaded the tessdata from https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata and copied in user directory.

RaiMan commented 5 years ago

@kumarkrish85 Ok, I can prepare a fat (includes all dependencies) sikulixapi.jar for you, with the Tess4J 4.4.0/Tesseract 4.1.0 complete with eng.traineddata to download from my Dropbox.

Please tell, if this would be suitable for you.

kumarkrish85 commented 5 years ago

@RaiMan , yes please share the link to download

RaiMan commented 5 years ago

ok, I will do it asap.

kumarkrish85 commented 5 years ago

Thanks @RaiMan

RaiMan commented 5 years ago

on OneDrive: https://1drv.ms/u/s!Ahzz_Daw4EefhUa1PI4i1XrKw4v3?e=uk48eh

Please test with Region.text() first. If you want to play around with any Tesseract settings, please look at the Tesseract 4.1.0 docs.

Please tell me, when you have successfully downloaded, so I can delete the file again.

Feedback about your tests is highly appreciated.

kumarkrish85 commented 5 years ago

@RaiMan I can't download from organization environment. I will try in my laptop and update

RaiMan commented 5 years ago

@kumarkrish85 Sorry for the inconvenience, but currently no other fast option.

kumarkrish85 commented 5 years ago

@RaiMan no issues :) , I have tested with your fat jar and tesseract. It worked successfully. But for few words like allcommonaction it fetches the text as allcommenactions from region. I did try different operations like fetch the text at right and left of the region. It worked fine. Appreciate your timely help and Thanks

balmma commented 5 years ago

Do you have ClearType active on your system? Turning it off also helps sometimes.

kumarkrish85 commented 5 years ago

Sure I will check and revert back. Thanks

balmma commented 5 years ago

Oh, I'm not really a Windows expert. But I would say it must be somewhere in display settings.

Edit: Before an edit the previous comment was:

How and where to check the configuration?

That's why my answer doesn't really make sense :-)

balmma commented 5 years ago

https://lmgtfy.com/?q=windows+10+deactivate+cleartype

balmma commented 5 years ago

Yes you definitely have ClearType enabled on your system (it's not your fault, it's enabled by default and usually desirable). You can clearly see it when you open your screenshot in e.g. Gimp and zoom in (does usually not work in image viewers because they interpolate when zoomed in): You see those orange, yellow and blue artifacts? Those are the rendered subpixels. I'm pretty sure that disabling ClearType will give you much better results.

balmma commented 5 years ago

@RaiMan I'll try to figure out how we can improve reliability of Tesseract in such cases. Some very basic tests indicate that applying a modest Gaussian Blur with a 0.5 px radius before the resizing dramatically decreases the error rate. But I have to verify this with much more samples :-)

kumarkrish85 commented 5 years ago

Thanks for the detailed solution. When these changes will be pushed to master branch.

balmma commented 5 years ago

For me it would also be interesting whether or not disabling ClearType helps in your case.

kumarkrish85 commented 5 years ago

By Disabling ClearType we are able to fetch the text clearly. Refer the image below

working

Can i capture about disabling ClearType in my tool OCR documentation?

balmma commented 5 years ago

Can i capture about disabling ClearType in my tool OCR documentation?

Sorry, don't understand the question :-). Can you rephrase please?

kumarkrish85 commented 5 years ago

I mean :) this configuration can i add it in my tool documentation ? (or) any plans for addressing this in code.

balmma commented 5 years ago

Now I got it :-)

We always disable ClearType on all machines we run SikuliX scripts on. It can also cause problems with finding images (using the find() method, of course only if the image contains text) because the sub pixels are somewhat unpredictable and can even change from screenshot to screenshot of the same screen. It's quite hard to tackle this properly in SikuliX.

Saying that, you can save yourself a lot of trouble just disabling it if applicable.

@RaiMan Probably we should also add this in the SikuliX documentation. We had massive problems with not found images before we figured this out.

balmma commented 5 years ago

@RaiMan Did some tests with more samples. Blurring helps in some cases but makes it even worse in others. By far the greatest positive effect has disabling ClearType on the OS level.

RaiMan commented 5 years ago

@balmma Thanks for the evaluation and comments.

Switch off ClearType: Probably we should also add this in the SikuliX documentation.

agreed. will do so.

Optimise image before giving it to Tesseract

As far as I have understood the Tesseract docs, the only must is to hand over an image with a resolution between 300 and 400 Dpi (which is done in SikuliX). Please give me the blur-code - at least we can add it as an option, that can be tried in case.

balmma commented 5 years ago

@RaiMan Main problem is, that Tesseract is optimized to recognize scanned documents and not artificially optimized text for LCD flat panel monitors :-) What we are doing is just to upscale all those ClearType artifacts and our screenshots are ending up something like this: If we wan't to do something we have to get rid of those artifacts first. Either by disabling ClearType or by some clever preprocessing. I have an idea in mind to achieve this, but need some time to do some more experiments :-)

RaiMan commented 5 years ago

@balmma understood and agreed.

side note: I have started a branch dev-opencv-4, where I upgrade to OpenCV 4.1.1. I will check wether it is possible to get the stuff also with homebrew to macOS, which would get the jar size down by about 30 MB.

RaiMan commented 5 years ago

@kumarkrish85 Did you make your final successful tests with the 1.1.4-jar I prepared for you (Tesseract 4) or with the latest available build (Tesseract 3)?

kumarkrish85 commented 5 years ago

@RaiMan I have used fat jar which you shared (Tesseract 4).

balmma commented 5 years ago

@RaiMan

size down by about 30 MB.

Sounds amazing :-)

RaiMan commented 5 years ago

@kumarkrish85 @balmma I will close this issue, when I have upgraded the official build to Tesseract 4 (will do asap).

kumarkrish85 commented 5 years ago

Thanks @RaiMan

kumarkrish85 commented 4 years ago

Hi @RaiMan , When the changes are planned to push to master? The next patch release of my tool depends on this update.

RaiMan commented 4 years ago

@kumarkrish85 Sorry for the delay, but I had a hard time the last days, to clarify the situation on Linux (Ubuntu 18.04). I am now through with it and will trigger a new build somewhen tomorrow containing the latest Tesseract 4 and this additional enhancement (#200)

kumarkrish85 commented 4 years ago

Thanks for the update @RaiMan

RaiMan commented 4 years ago

please try with latest build #382 from today. Should work.

balmma commented 4 years ago

tr.setVariable("load_system_dawg", "false") tr.setVariable("load_freq_dawg", "false")

I've just figured out that this doesn't work. Those two can only be used in the init function of Tesseract, means that they have to be specified in a config (https://github.com/tesseract-ocr/tesseract/wiki/ControlParams).

To get it working you have to perform the following steps:

Place a file called nodict in appdata/SikulixTesseract/tessdata/configs (appdata is ~/.Sikulix on Linux, C:\Users\<user>\AppData\Roaming\Sikulix on windows) with the following content:

load_system_dawg     F
load_freq_dawg       F

Use it from Sikulix:

TextRecognizer recog = TextRecognizer.start();
recog.setPSM(12);
recog.setConfigs(Arrays.asList(new String[] {"nodict"}));

@RaiMan Might be worth to add this to the docs.

balmma commented 4 years ago

@RaiMan And probably the digit config from #73 as well.

RaiMan commented 4 years ago

The OCR docs have to be revised anyways. On my list now.

RaiMan / SikuliX1

Store image text in latest Sikuli 1.1.4_SNAPSHOT API #197