Closed vivadavid closed 8 months ago
I take the opportunity to wish you, @TheJoeFin, and everybody a merry Christmas and a happy New Year.
Hi, @TheJoeFin,
I can confirm that character encoding works now on version 4.3.1: thanks!
Sorry, I've just noticed that if you use the tool Extract Text from Images in Folder, the encoding problems are still there. I've done the following testing:
By the way, I had always run the tool on a folder containing plenty of images, so I had never noticed the following message due to the window scroll as the text is generated:
Tesseract can only run single threaded, May be slower if processing many images Press Escape to cancel
I suggest displaying this message in a different way, so that it can be actually read by the user. One possibility could be showing up a little pop-up window before proceeding, with a button to be ticked not to display this message again.
Hi, again,
It's weird, but the problems I had with Tesseract on Fullscreen Grab are back: maybe I didn't check it out properly when I wrote my first message earlier today.
So, to sum up, and with my OCR settings on Spa with Tesseract, the situation rests like this:
@vivadavid can you make sure you don't have any older versions of Text Grab running, and make sure you are using v4.3.1.
If there is anything different about the test setup you are using let me know. I was able to repro the bug before, then after the fix the bug went away for me.
Hi, @TheJoeFin ,
Before I posted my messages the other day, I made sure that I was running the latest version (4.3.1), but I've doubled-checked, just in case, and it's indeed 4.3.1.
Just to try something else, I've deleted the settings file (user.config). I couldn't find it in the same folder as the executable, but I eventually found it here:
C:\Users\UserName\AppData\Local\Text_Grab
(By the way, if your intention is to make Text Grab completely portable, I'd suggest keeping the settings file on the same folder as the executable, where it can be easily accessible in case you want to read it, copy it or delete it.).
Unfortunately, the problems persist after this last action.
By the way, as usual, I'm running the executable file (NOT the self contained or the Windows Store versions).
Please let me know if there's something else I can do. My Tesseract installation is brand new and my language packages were downloaded through the installer, so this should be fine.
I hope I haven't done anything weird by mistake which might cause you to waste time on this issue.
Thank you for your time!
@vivadavid I updated the files in the release. Try those new files and see if that fixes it for you. I think there was an error in building on different branches some had some fixes.
Thank you, @TheJoeFin : the Unicode problems seem to be fixed now!
Unicode problems aside, I still detect the same other issue I mentioned earlier regarding Extract Text from Images in Folder:
Should I open a separate page?
Thanks again for fixing the Unicode issue!
Great! Yeah, open a new issue and include the sample photos if you can.
Great! Yeah, open a new issue and include the sample photos if you can.
I've just done it!
Describe the bug When performing OCR via Tesseract on text with extended characters, the result has encoding problems.
Where is the bug
To Reproduce Steps to reproduce the behavior:
Expected behavior The extended characters should be encoded and displayed properly, and yet, there are problems, for example, with Spanish characters carrying an accent (á, é, í, ó, ú), so that instead of getting carácter estratégico, you get carácter estratégico.
Screenshots This is the link to the original web article.
Screenshot of a passage of the web article:
Screenshot of the text recognized via Windows OCR:
Screenshot of the text recognized via Tesseract:
Where did you get Text Grab?
Desktop (please complete the following information):