SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.52k stars 895 forks source link

SE 3.5.13 & Tesseract 4.1.1 OCR Bug ( -1073741795 error ) #3933

Closed wtester7 closed 4 years ago

wtester7 commented 4 years ago

Hi @niksedk,

just tested your newly released SE 3.5.13 & Tesseract 4.1.1 but it is impossible to OCR a mks file. With SE 3.5.11 & Tesseract 4.1.0 there is no problem when OCR'ing...

Somethings wrong with Tesseract 4.1.1 in your SE 3.5.13. I have 7zipped the fresh portable SE 3.5.13 with the mks file "00003.mks" in the root directory so you can test the problem for yourself...

https://www.upload.ee/files/11005579/SE3513.7z.html

Btw, this happens with other mks files too!

Thanks Nik!

niksedk commented 4 years ago

The .mks file works fine here... what your OS and what's the problem?

wtester7 commented 4 years ago

Win10 x64 Build 1903.

With SE 3.5.11 & Tesseract 4.1.0 there is no problem when OCR'ing...

With SE 3.5.13 & Tesseract 4.1.1 this problem occurs when OCR'ing:

Tesseract_4.1.1_OCR_Error.jpg

vivadavid commented 4 years ago

Hi,

I've tested the MKS file and I get the same error on Win 10 x64 1909. I saw the "Tesseract returned with code" message for the first time with SE 3.5.12, so it looks like it must have something to do with Tesseract 4.1.1.

wtester7 commented 4 years ago

This is a serious problem because I tested a bit more and came to the same result i.e the tesseract error.

This caused me a BSOD on my Win10 because I had Photoshop opened with 70% of ram usage, plus my Vivaldi browser and various tabs opened with about 80% ram usage, I had about 1 gigabyte ram left and when I OCR'ed the .mks file and the error came, a minute later I had a BSOD.

This means there is a memory leak which over buffers the ram and then the BSOD will happen...

niksedk commented 4 years ago

What happens if you run tesseract.exe via the cmd line?

niksedk commented 4 years ago

I've tested on three different computers (one is a clean pc with just Visual Studio) and all are working fine... could perhaps be something with "Visual C++ Redistributable"?

To which folder / how did you install SE?

OmrSi commented 4 years ago

It's also working fine here. Win10 x64, 1909 Build 18363.5592 Maybe try using LSTM only? image

xylographe commented 4 years ago

I tried with Tesseract 4.1.0 and 4.1.1 (Original Tesseract only) on Win7. The results are identical.

wtester7 commented 4 years ago

OK Guys, I have pinpointed the error down: xylographe has the right setting!

The following Tesseract 4.1.1 settings in SE 3.5.13 are working properly without errors!:

If I choose LSTM Only, LSTM + Tesseract or Default, the fatal error will be triggered, only Original Tesseract only is working!!


All Tesseract 4.1.0 OCR modes in SE 3.5.11 are working without any errors!!!


@niksedk I am only using portable programs, also I have all Visual C++ installed, it is not the problem, the Tesseract 4.1.1 modes are the problem, only Original Tesseract only mode is working fine!

wtester7 commented 4 years ago

As you read above, vivadavid also has the same problem!

@vivadavid can you please test with these settings:

and see if it's also the only mode that is working properly for you?

vivadavid commented 4 years ago

Hi, using the settings suggested by @wtester7, things worked.

None of the other three modes worked.

I'm also using the portable version. As for Visual C++, I have the x86 and x64 versions of C++ 2013 and C++ 2015-2019.

xylographe commented 4 years ago

I have now tested "LSTM Only" and "LSTM + Tesseravt" too. I also tried disabling "Italic" and/or "Music symbol". All modes/combinations generate acceptable results, though it would appear (first impression only) that the best result (the least errors) is generated by "Tesseract only".

xylographe commented 4 years ago

Just a wild guess. Is there any chance that the problems are caused by a missing osd.traineddata?

wtester7 commented 4 years ago

Nope! I'm am very sure somethings have changed in Tesseract 4.1.0 to 4.1.1 that is causing the problems.

Why is that only "Original Tesseract only" mode is working properly and the other 3 modes are defect? How do the other 3 modes differ? Why is that in Tesseract 4.1.0 all modes in SE 3.5.11 are working without any errors!?

wtester7 commented 4 years ago

I have an idea how to further pinpoint the problem. How about @niksedk or an other helping dev can compile Tesseract 4.1.1 for SE 3.5.11 so I can test it and see if it's working properly. We can then exclude or see if the changes that have been made since 3.5.11 to 3.5.13 are causing the problem. If the commits are not the fault then it's Tesseract 4.1.1

xylographe commented 4 years ago

In the folder where you have extracted SE 3.5.11 you open the Tesseract410 folder. You rename the tesseract.exe file in that folder to tesseract.exe.410. Copy or link the new Tesseract 4.1.1 executable to that folder, and start SE.

wtester7 commented 4 years ago

Ok, so it really is the new Tesseract 4.1.1. I have just tested it and in SE 3.5.11 it is the same problem! Only the mode "Original Tesseract Only" is working fine, all other modes ( Default, LSTM Only, LSTM + Tesseract ) are crashing!

I have also tried the other way around, put Tesseract 4.1.0 in SE 3.5.13 and all modes are working fine!

xylographe commented 4 years ago

For now, anyone experiencing problems with Tesseract 4.1.1, can download Tesseract410.tar.gz, and replace the new tesseract.exe in SE-3.5.13 (the containing folder is named Tesseract411) with the old tessseract.exe.

vivadavid commented 4 years ago

Thank you, @xylographe. I've just tried your suggestion and it works!

First, I deleted the folder too, but SE didn't prompt me to download the trained data files, as the programme thought that they had already been downloaded; as a result, I couldn't do the OCR.

My next step was to restore the folder with the files downloaded with Tesseract 4.1.1. It worked!

SE still shows Tesseract 4.1.1, while the file tesseract.exe corresponds to Tesseract 4.1.0 and the trained data corresponds to 4.1.1. Is that correct?

niksedk commented 4 years ago

@wtester7 / @vivadavid: what happens if you run tesseract 4.1.1 in the command prompt? See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

wtester7 commented 4 years ago

cmd: tesseract 00003.mks test.srt --oem 1 -l eng

Tesseract Open Source OCR Engine v4.1.1 with Leptonica Error during processing.


cmd: tesseract 00003.mks test.srt --oem 0 -l eng

Tesseract Open Source OCR Engine v4.1.1 with Leptonica Error during processing.


It doesn't like the .mks file, tried with a .jpg and it works...

niksedk commented 4 years ago

@wtester7: Could you try a .png image with all engine modes in cmd line (via the "oem" parameter) ?

  "--oem 0"    Legacy engine only.
  "--oem 1"    Neural nets LSTM engine only.
  "--oem 2"    Legacy + LSTM engines.
  "--oem 3"    Default, based on what is available.
vivadavid commented 4 years ago

I had never used the command line with Tesseract, but I gave it a try with a PNG file containing English text.

Using oem 1, the output file was empty: tesseract image.png output --oem 1 -l eng

Using oem 0, it worked: tesseract image.png output --oem 0 -l eng

Then I tried with the MKS file.

Using either oem 0 or eom 1, I got the same error message ("Error during processing"): tesseract 00003.mks output --oem 1 -l eng tesseract 00003.mks output --oem 0 -l eng

vivadavid commented 4 years ago

Sorry, I've just seen your message, @niksedk

With the PNG file, I tried --oem 2 and --oem 3: in both cases, I got no error message and the file was empty.

tesseract image.png output --oem 2 -l eng tesseract image.png output --oem 3 -l eng

wtester7 commented 4 years ago

tesseract test.png test.srt --oem 0 -l eng = works, written ocr file! tesseract test.png test.srt --oem 1 -l eng = empty text file tesseract test.png test.srt --oem 2 -l eng = empty text file tesseract test.png test.srt --oem 3 -l eng = empty text file

vivadavid commented 4 years ago

Same results as @wtester7.

wtester7 commented 4 years ago

Just an idea to consider, I have a pretty old first generation 6-Core i7-970 CPU that is 10 years old but still rocking ( waiting for Zen3/4 to upgrade ;-)), maybe there are some new added features/changes ( since Tesseract 4.1.0 ) to the latest Tesseract 4.1.1 that is not compatible with old CPU's?

Edit: But then, why is Original Tesseract/Legacy mode working and the other 3 engine modes not?!

vivadavid commented 4 years ago

I found this information on https://github.com/UB-Mannheim/tesseract/wiki:

"We don't provide an installer for Tesseract 4.1.0 because we think that the latest version 5.0.0-alpha is better for most Windows users in many aspects (functionality, speed, stability). Version 4.1 is only needed for people who develop software based on the Tesseract API and who need 100 % API compatibility with version 4.0."

There's something going on with Tesseract 4.1.(x) if they don't recommend using it. The funny thing is that version 4.1.0 didn't cause any trouble, as opposed to 4.1.1

vivadavid commented 4 years ago

Like @wtester7, my CPU is also quite old: a first generation Intel Core i7.

vivadavid commented 4 years ago

Hi!

I've tested the issue on 3.5.14 and it's still there. Judging by the changelog, this was expected, but I wanted to try anyway.

Will it be fixed in a future release of Subtitle Edit or do we need to wait for the next version of Tesseract? In the meantime, it might make sense to go back to Tesseract 4.1.0, which didn't show this problem.

niksedk commented 4 years ago

I think they added some cpu optimizations without checking for support in 4.1.1 - perhaps AVX (it also does run much faster than 4.1.0).

So you can use 3.02 or do the hack of copying the 4.1.0 files to the 4.1.1 folder in SE - see https://github.com/SubtitleEdit/subtitleedit/issues/3933#issuecomment-576618240

Compiling tesseract is extremely difficult.

vivadavid commented 4 years ago

Thank you, @niksedk . I think they have been working on Tesseract 5 for quite some time now, which will hopefully make things easier.

niksedk commented 4 years ago

Does the Tesseract 5 alpha work for you? ( used the one from here: https://github.com/UB-Mannheim/tesseract/wiki ) https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.14/SubtitleEditBeta.zip

vivadavid commented 4 years ago

I got an error message @niksedk .

2020-03-09 - 19 24 13 -

niksedk commented 4 years ago

@vivadavid / @wtester7: Can you run "tesseract.exe" via command line from this package: https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v5.0.0-alpha.20200223.exe ( for more info see https://github.com/UB-Mannheim/tesseract/wiki )

vivadavid commented 4 years ago

Hi, @niksedk

I've tried with every value for --oem and I keep getting an error message.

2020-03-14 - 12 06 32 - Tesseract 5 (error message)

niksedk commented 4 years ago

@vivadavid: Was that via tesseract command line and the Mannheim version? Like tesseract a.png out -l eng ?

My guess is that some CPU optmization is enabled, that your cpu does not have...

vivadavid commented 4 years ago

@niksedk , I installed the programme from the link you gave me and typed commands like this:

tesseract image.png output --oem 0 -l eng

You may be right about the CPU optimization, because my laptop is very old: 1st generation Intel Core.

niksedk commented 4 years ago

@vivadavid: thx for testing :)

vivadavid commented 4 years ago

You are welcome :-) , @niksedk

Thank you for making such a great programme!

niksedk commented 4 years ago

You're welcome too :)

If you really want this fixed, try creating an issue here: https://github.com/UB-Mannheim/tesseract/issues

EDIT: Remember to include info about OS and perhaps CPU too! Cpu info tool: https://www.cpuid.com/softwares/cpu-z.html

vivadavid commented 4 years ago

Good idea, @niksedk

I've just done it: https://github.com/UB-Mannheim/tesseract/issues/31

niksedk commented 4 years ago

@vivadavid: could you attach a screenshot from http://download.cpuid.com/cpu-z/cpu-z_1.91-en.zip ?

Should be like: image

My CPU has AVX

niksedk commented 4 years ago

@wtester7: How does your cpu info look like?

vivadavid commented 4 years ago

There you go, @niksedk . Thank you for taking the interest.

2020-03-15 - 15 19 47 - CPU-Z

stweil commented 4 years ago

Tesseract from UB Mannheim should work on any Intel or AMD x86 compatible CPU. Anything else would be a bug.

stweil commented 4 years ago

@vivadavid, did you keep the files from the installation together, or did you move tesseract.exe to a different directory?

vivadavid commented 4 years ago

@stweil , I didn't move anything.

wtester7 commented 4 years ago

CPU-Z

wtester7 commented 4 years ago

@stweil my path is: E:\Program Files\PortableProgs\Video\Subtitle Edit\Tesseract411\tesseract.exe

But I don't think that it matters because older ocr versions doesn't have that problem...