Encoding problems with extended characters via Tesseract

vivadavid commented 8 months ago

Describe the bug When performing OCR via Tesseract on text with extended characters, the result has encoding problems.

Where is the bug

OCR Output

To Reproduce Steps to reproduce the behavior:

Go to Fullscreen Grab.
Click on spa with Tesseract.
Select text that contains extended characters (á, é, í, ó, ú).

Expected behavior The extended characters should be encoded and displayed properly, and yet, there are problems, for example, with Spanish characters carrying an accent (á, é, í, ó, ú), so that instead of getting carácter estratégico, you get carÃ¡cter estratÃ©gico.

Screenshots This is the link to the original web article.

Screenshot of a passage of the web article:

screenshot

Screenshot of the text recognized via Windows OCR:

OCR_Windows

Screenshot of the text recognized via Tesseract:

OCR_Tesseract

Where did you get Text Grab?

GitHub releases
- Exe

Desktop (please complete the following information):

OS: Windows 11.
Version: 23H2.
Text Grab Version: 4.3.0.

vivadavid commented 8 months ago

I take the opportunity to wish you, @TheJoeFin, and everybody a merry Christmas and a happy New Year.

vivadavid commented 8 months ago

Hi, @TheJoeFin,

I can confirm that character encoding works now on version 4.3.1: thanks!

vivadavid commented 8 months ago

Sorry, I've just noticed that if you use the tool Extract Text from Images in Folder, the encoding problems are still there. I've done the following testing:

I've placed two images in a folder and I've run the tool. The result is that Tesseract doesn't work on image1.jpg (there's no recognized text), though it does work on image2.jpg, but with encoding problems.
However, if I drag these two images on the Edit Text Window, Tesseract works on both images and there are no encoding problems.

images.zip

By the way, I had always run the tool on a folder containing plenty of images, so I had never noticed the following message due to the window scroll as the text is generated:

Tesseract can only run single threaded, May be slower if processing many images Press Escape to cancel

I suggest displaying this message in a different way, so that it can be actually read by the user. One possibility could be showing up a little pop-up window before proceeding, with a button to be ticked not to display this message again.

vivadavid commented 8 months ago

Hi, again,

It's weird, but the problems I had with Tesseract on Fullscreen Grab are back: maybe I didn't check it out properly when I wrote my first message earlier today.

So, to sum up, and with my OCR settings on Spa with Tesseract, the situation rests like this:

There are problems with both Fullscreen Grab and Extract Text from Images in Folder.
There are NO problems when I drag some images on the Edit Text Window.

TheJoeFin commented 8 months ago

@vivadavid can you make sure you don't have any older versions of Text Grab running, and make sure you are using v4.3.1.

If there is anything different about the test setup you are using let me know. I was able to repro the bug before, then after the fix the bug went away for me.

vivadavid commented 8 months ago

Hi, @TheJoeFin ,

Before I posted my messages the other day, I made sure that I was running the latest version (4.3.1), but I've doubled-checked, just in case, and it's indeed 4.3.1.

Just to try something else, I've deleted the settings file (user.config). I couldn't find it in the same folder as the executable, but I eventually found it here:

C:\Users\UserName\AppData\Local\Text_Grab

(By the way, if your intention is to make Text Grab completely portable, I'd suggest keeping the settings file on the same folder as the executable, where it can be easily accessible in case you want to read it, copy it or delete it.).

Unfortunately, the problems persist after this last action.

Would you like me to send you my settings file? I'd prefer to use the e-mail for that.
Do you have Tesseract 5.3.3 from Mannheim (tesseract-ocr-w64-setup-5.3.3.20231005.exe) installed on your PC?
In your own testing, did you try the Spanish language?

By the way, as usual, I'm running the executable file (NOT the self contained or the Windows Store versions).

Please let me know if there's something else I can do. My Tesseract installation is brand new and my language packages were downloaded through the installer, so this should be fine.

I hope I haven't done anything weird by mistake which might cause you to waste time on this issue.

Thank you for your time!

TheJoeFin commented 8 months ago

@vivadavid I updated the files in the release. Try those new files and see if that fixes it for you. I think there was an error in building on different branches some had some fixes.

vivadavid commented 8 months ago

Thank you, @TheJoeFin : the Unicode problems seem to be fixed now!

Unicode problems aside, I still detect the same other issue I mentioned earlier regarding Extract Text from Images in Folder:

On Image1.jpg (see attached ZIP file), no text is recognized (but it's recognized when I drag and drop the image).
On Image2.jpg, the text is recognized, but it's not exactly the same text recognized when I drag and drop the image.

Should I open a separate page?

Thanks again for fixing the Unicode issue!

TheJoeFin commented 8 months ago

Great! Yeah, open a new issue and include the sample photos if you can.

vivadavid commented 8 months ago

Great! Yeah, open a new issue and include the sample photos if you can.

I've just done it!

TheJoeFin / Text-Grab

Encoding problems with extended characters via Tesseract #409