SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.69k stars 908 forks source link

3.5.9 Bug and very slow performance #3431

Closed wtester7 closed 5 years ago

wtester7 commented 5 years ago

Hello,

I have discovered a nasty BUG in the portable Subtitle Edit 3.5.10 Version and I am really disappointed with Tesseract 4, it's just too slow!!! I'm using a first generation Quad Core 2,67 Ghz and Win7 x64.

I have done various benchmarks with a BluRay 1117 lines PGS and the OCR with Tesseract 4:

Subtitle 3.5.10 portable:

Tesseract 3.02 ( 4 errors ) = 07 Minutes : 20 Seconds Tesseract 4 , Engine Mode "Default" ( 4 errors ) = 15 Minutes : 35 Seconds Tesseract 4 , Engine Mode "Original Tesseract only" ( 14 errors ) = 12 Minutes : 32 Seconds Tesseract 4 , Engine Mode "LSTM only" ( 2 errors ) = 15 Minutes : 22 Seconds Tesseract 4 , Engine Mode "Tesseract + LSTM" ( 5 errors ) = 16 Minutes : 46 Seconds

Subtitle 3.5.6 portable: Tesseract ( 5 errors ) = 01 Minute : 50 Seconds !!!


Bug in Subtitle 3.5.10 portable:

Bug: Prompt for unknown Words Popup - Change all Button - replaces original eng_OCRFixReplaceList.xml ( 136 KB ) into a eng_OCRFixReplaceList.xml ( 2 KB ) - everything from the original eng_OCRFixReplaceList.xml is lost!

WORKS: Prompt for unknown Words Popup - Add to names/noise list Button - adding words to en_names.xml. WORKS: Prompt for unknown Words Popup - Add to user dictionary Button - adding words to en_US_user.xml. WORKS: Prompt for unknown Words Popup - USE ALWAYS - adds words to eng_OCRFixReplaceList_User.xml

Bugs in Subtitle 3.5.6 portable:

Bug: Prompt for unknown Words Popup - Change all Button - replaces original eng_OCRFixReplaceList.xml ( 136 KB ) into a eng_OCRFixReplaceList.xml ( 2 KB ) - everything from the original eng_OCRFixReplaceList.xml is lost! Bug: Prompt for unknown Words Popup - Add to names/noise list Button - doesnt add words from the unknown Words Popup into en_names.xml, but from the unknown words table list it adds into en_names.xml Bug: Prompt for unknown Words Popup - Add to user dictionary Button - doesnt add words from the unknown Words Popup into en_US_user.xml, but from the unknown words table list it adds into en_US_user.xml

WORKS: Prompt for unknown Words Popup - USE ALWAYS - adds words to eng_OCRFixReplaceList_User.xml


As you can see the Tesseract in Subtitle 3.5.6 portable is the fastest, OCR takes only 01 Minute : 50 Seconds with 5 errors!!! What is going on with Tesseract 4??? It's 8x times slower with more errors!!!

It would be really great if you can provide a new version with the Tesseract from 3.5.6 with all mentioned bugs fixed. Well most mentioned bugs are fixed in version 3.5.10, the only remaining bug is the:

Prompt for unknown Words Popup - Change all Button - replaces original eng_OCRFixReplaceList.xml ( 136 KB ) into a eng_OCRFixReplaceList.xml ( 2 KB ) - everything from the original eng_OCRFixReplaceList.xml is lost!

I would really appreciate it, thanks!

wtester7 commented 5 years ago

Your 443 lines DVD.sub file with all OCR options enabled ( except prompt unknown words disabled ):

With 3.5.6 in 1 Minute 34 Seconds finished With Beta in 2 Minutes 4 Seconds finished

wtester7 commented 5 years ago

I think I know where the problem lies.

In SE Version 3.5.7 nik linked SubtitleEdit.exe with Tesseract 4. In SE Version 3.5.8 nik linked SubtitleEdit.exe back to Tesseract 3.02 but enabled an option to download Tesseract 4 and use it. This means there is the support written in code -> SubtitleEdit.exe to use both Tesseract 3.02 and Tesseract 4. This support was not implemented and written in 3.5.6 SubtitleEdit.exe ( you can't choose what Tesseract Version to use for OCR ).

Maybe this support ( code ) since 3.5.7 is the cause for the massive performance slow down for Tesseract 3.02!

Ding-adong commented 5 years ago

Your 443 lines DVD.sub file with all OCR options enabled ( except prompt unknown words disabled ):

With 3.5.6 in 1 Minute 34 Seconds finished With Beta in 2 Minutes 4 Seconds finished

Your beta test took 54 seconds longer than mine. Your computer's 'dying'.

Tried looking at the changes in codes and there's too many for me to help.

wtester7 commented 5 years ago

Your 443 lines DVD.sub file with all OCR options enabled ( except prompt unknown words disabled ): With 3.5.6 in 1 Minute 34 Seconds finished With Beta in 2 Minutes 4 Seconds finished

Your beta test took 54 seconds longer than mine. Your computer's 'dying'.

Tried looking at the changes in codes and there's too many for me to help.

Trust me, my computer is not dying :) You forgot that I have a 10 year old 1st Generation Quad Core i7 @ 2,67 Ghz and it's very healthy and stable. In the last 10 years I almost never had any bluescreen or shutdown or CPU failure. I can run a torture test in Prime95 without any problems.

In Passmark I have 4,920 points, you have a 6-Core ( I don't know which model ) that would be in the 15,000 points range. https://www.cpubenchmark.net/high_end_cpus.html This means your CPU is 3x times faster, thus 54 seconds faster, it's really simple math...

My old CPU doesn't matter, fact is that SubtitleEdit.exe of 3.5.6 is 2x - 2,5x times faster ( at least for me and most possible for other people too ) than SubtiteEdit.exe of 3.5.7 and other versions till Beta, because of certain code changes... possibly this one https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-468794906

sneaker2 commented 5 years ago

I'm surprised everyone gets so vastly different benchmark results. I tried DVD.idx/.sub on my i5-2500K (AVX, no AVX2) with 3.5.9 beta 103: Tesseract 3.02: 28 seconds Tesseract 4 with "Neural Nets LSTM only": 3 minutes 55 seconds Over 8 times slower ...

I read the following on a different project:

I want to stress there are two independent reasons Tesseract 4 can be slow compared to 3:

  1. Multiprocessing parallel instances of Tesseract 4 without setting OMP_THREAD_LIMIT=1 (leads to too many processes fighting over CPU time)
  2. Running Tesseract 4 on a processor without AVX2 https://github.com/the-paperless-project/paperless/issues/438#issuecomment-462189569

I replaced tesseract 4 with binary taken from VietOCR 4.5.3 and it only took 1 minute 5 seconds.

wtester7 commented 5 years ago

Great catch sneaker2!

sneaker2 wrote: I'm surprised everyone gets so vastly different benchmark results.

Everyone has vastly different benchmark results because of different CPU speed. Faster CPU = faster OCR finish , slower CPU = slower OCR finish , it's really that simple... This means comparing different speed with other users here is useless, what matters is that you personally compare your results with different Subtitle Edit Versions and different Tesseract Versions like you did.

But you are right about the Tesseract 4 build from VietOCR-5.4.3. I have compared it with latest Subtitle Edit Beta's Tesseract 4 and VietOCR-5.4.3 Tesseract 4 is really faster: Benchmark done with my BluRay PGS mks file ( 00274.zip )

All OCR options enabled ( minus Prompt for unknown words disabled ) Subtitle Edit Beta's Tesseract 4: Engine Mode "Default" = 15 Minutes : 35 Seconds VietOCR 5.4.3 Tesseract 4: Engine Mode "Default" = 12 Minutes : 18 Seconds That is 3 Minutes 17 Seconds faster or 26%. This result is for me

Your result was even more impressive: 3 minutes 55 seconds vs 1 minute 5 seconds, that's 2 Minutes 50 Seconds difference, nearly 4x times faster!

What we know so far:

niksedk commented 5 years ago

About the Tesseract 4 performance from VietOCR - it's probably because VietOCR uses lastest source, where SE uses 4.0 final. I just tested latest Tesseract source too and OCR'ed a vobsub in 2 mins 16 secs via latest source and got 2 mins 39 secs from T4 4.0.

I think the SE 3.5.7 uses a thread pool and SE 3.5.6 uses a single background thread, but there's has been a lot of other source changes too... more retries, resizing to get better results etc.

@wtester7: Did you set spell dictionary to "None" when doing these tests? SE counts unknown words and tries resizing/different parameters with Tesseract if unknown words > 0.

wtester7 commented 5 years ago

@wtester7: Did you set spell dictionary to "None" when doing these tests? SE counts unknown words and tries resizing/different parameters with Tesseract if unknown words > 0.

@niksedk I am sorry for doubting you, you're right about the re-tries, it really is the dictionary option. I guess Tesseract always uses spellchecking the dictionaries when dictionary with language is enabled. It also means the bigger the dictionaries, the slower the speed, right?

Tesseract 3.02 Test with ( 00274.zip ):

SE 3.5.6 - All options on ( - Prompt for unknown words ) - Dictionary: None = 1 Minute 28 Seconds SE 3.5.6 - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 1 Minute 24 Seconds

SE Beta - All options on ( - Prompt for unknown words ) - Dictionary: None = 59 Seconds SE Beta - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 59 Seconds

You're Beta is even more optimized and faster than Ver. 3.5.6! So all is good, no need to look for performance bugs ;-)


Still VietOCR 5.4.3's Tesseract 4 is really much faster than yours:

Tesseract 4 Tests, Mode = Default with ( 00274.zip ):

Tesseract 4 from SE Beta: SE Beta - All options on ( - Prompt for unknown words ) - Dictionary: None = 6 Minutes 1 Second SE Beta - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 5 Minutes 59 Seconds

Tesseract 4 from VietOCR 5.4.3: SE Beta - All options on ( - Prompt for unknown words ) - Dictionary: None = 3 Minutes 29 Seconds SE Beta - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 3 Minutes 28 Seconds


Conclusion: I will stick to Subtitle Edit Beta but still use Tesseract 3.02. Tesseract 4 with Engine Mode "LSTM only" is really great for accuracy but it's really slow... Btw, can "LSTM only" detect Italic?


Thank you for your work Nik :-)

niksedk commented 5 years ago

@wtester7: Thx for testing again :)

About T4: SE will include the faster version 4.1.0 when it's out in a final version (it's 4.1.0 RC1 atm). And no, I'm afraid it does not support italic formatting - but feel free to make a wish in their issue list ;)

By the way, how did you get a "tesseract.exe" from VietOCR 5.4.3 - I only got an dll file... and also, what version does this "tesseract.exe" report?

sneaker2 commented 5 years ago

tesseract --version:

tesseract 4.0.0 leptonica-1.77.0 (Dec 27 2018, 14:56:24) [MSC v.1900 DLL Release x86] libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.9 : zlib 1.2.11 : libw ebp 0.6.1 : libopenjp2 2.3.0

wtester7 commented 5 years ago

Here you go: https://sourceforge.net/projects/vietocr/files/vietocr/5.4.3/VietOCR-5.4.3.zip/download

niksedk commented 5 years ago

thx :) OK, when 4.1.0 final is out I guess I'll have to test against tesseract from VietOCR. I wonder how it's compiled!? (I just used "vcpk")

niksedk commented 5 years ago

T4 rc1 --version gives

       tesseract 4.1.0-rc1
wtester7 commented 5 years ago

thx :) OK, when 4.1.0 final is out I guess I'll have to test against tesseract from VietOCR. I wonder how it's compiled!? (I just used "vcpk")

@niksedk Maybe you should look what @sneaker2 posted:

I read the following on a different project:

I want to stress there are two independent reasons Tesseract 4 can be slow compared to 3:

  1. Multiprocessing parallel instances of Tesseract 4 without setting OMP_THREAD_LIMIT=1 (leads to too many processes fighting over CPU time)
  2. Running Tesseract 4 on a processor without AVX2 the-paperless-project/paperless#438 (comment)

Or you can ask the dev of VietOCR directly here: https://github.com/nguyenq

Ding-adong commented 5 years ago

@niksedk @wtester7: Did you set spell dictionary to "None" when doing these tests? SE counts unknown words and tries resizing/different parameters with Tesseract if unknown words > 0.

I did the experiment with the above advice and tess 4 was faster than all other previous experiments by 10 seconds. Previously tess 4 was slower by 10 seconds thus a swing/turnaround/saving of 20 seconds. I wish I knew about this before. How about a manual?

Ding-adong commented 5 years ago

How precisely did you lot use vietocr?

Not sure if I did it right as it was 24 seconds slower than previous record. Everyone's claiming it's faster!!!

wtester7 commented 5 years ago

@Ding-adong see my benchmarks with options here: https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-469018111

Download VietOCR 5.4.3 here: https://sourceforge.net/projects/vietocr/files/vietocr/5.4.3/VietOCR-5.4.3.zip/download

Copy the content of tesseract-ocr folder from VietOCR 5.4.3 to Subtitle Edit Beta/Tesseract4 folder In the OCR options disable all options & set Dictionary to "None". The only enabled options are Italic + Music Symbols

Compare your results from VietOCR 5.4.3's Tesseract 4 with the original Tesseract 4 from Beta.

P.S I have deleted the vie.traineddata from VietOCR 5.4.3 in the tessdata folder because I don't need the Vietnamese language.

Ding-adong commented 5 years ago

Tess 3.02 - 30 seconds - 202 errors - spellchecker needed to correct numerous LL inside words instead of ll. Other errors needed personal corrections.

Tess 4 original - 34 seconds - 212 errors - spellchecker needed to correct some (less than above) LL inside words instead of ll. Other errors needed personal corrections, less than above. Post OCR quicker by 2 ish minutes compared to above.

Tess 4 LSTM - 65 seconds - 260 errors - mostly or = instead of -. | instead of I - spellchecker needed to correct = to - and no LL. Other errors needed personal corrections.

Tess 4 Tess + LSTM - 65 seconds - 233 errors - mostly or = instead of -. | instead of I - spellchecker needed to correct = to - and no LL. Other errors needed personal corrections.

LSTM is good at recognising LL and showed correct ll The font in my vob l looks like l with a small curve dash at the bottom so tess sees it as LL but LSTM sees it as ll. However LSTM and tess sees instead of - and | instead of I but both are easily fixed by fix common errors. Acceptable to me. LSTM also sees = instead of - and randomly picks = or . The = is a major let down. How can one dash becomes 2 dash =, plainly ridiculous.

@niksedk I'd added 5 lines to the ocrreplacelist.

<Word from="He'Ll" to="He'll" />
<Word from="he'Ll" to="he'll" />
<Word from="We'Ll" to="We'll" />
<Word from="we'Ll" to="we'll" />
<Beginning from="= " to="- " />

Tested Tess 4 Tess + LSTM again and was a lot quicker than previous tests post ocr, as fix common errors corrected the = into -.

With Vietocr.

Tess 4 original - error - Tess returned with code 1.

Tess 4 LSTM - 31 seconds - 242 errors - no= at all, strange. New errors such as "L instead of "I. Post OCR was quicker.

Tess 4 Tess + LSTM - error - Tess returned with code 1.

Can someone please explain the code 1 error.

wtester7 commented 5 years ago

@Ding-adong

VietOCR 5.4.3's Tesseract 4 is not fully compatible with SubtitleEdit.exe from Beta.

But Engine Mode "Default" and "LSTM only" is working.

Try to compare speed with VietOCR 5.4.3's Tesseract 4 with:

vs. Original Tesseract 4 from Beta with same settings...

Please let me know your speed result, VietOCR 5.4.3's Tesseract 4 should be much faster than the Original Tesseract 4 from Beta, based on my benchmark result and from @sneaker2

Btw. don't worry about the errors, you have disabled the dictionary/spellchecking and Fix OCR errors etc. These options should be only used for benchmark testing ;-)

wtester7 commented 5 years ago

Ding-adong wrote:

Tess 4 LSTM ( Original ) - 65 seconds

Tess 4 LSTM ( VietOCR 5.4.3 ) - 31 seconds

You see VietOCR 5.4.3's Tesseract is indeed much faster, for you it's 2x times faster, for sneaker2 it is 4x times faster and for me ( I did the test with Engine mode "Default" ) it is also nearly 2x times faster...

wtester7 commented 5 years ago

Btw @niksedk does Tesseract 4 Engine Mode "Default" can detect italic or only Engine Mode "Original Tesseract only" ?

niksedk commented 5 years ago

It's really not saying much testing VietOCR Tesseract as we don't know the exact build / commit / date... Tesseract 4.1.0 from today is attached. tesseract4.0.1-rc1.zip

I think "Default" mode will use LSTM (neural networks) if available, otherwise fallback to something else, so it probably correct to choose "original"...

wtester7 commented 5 years ago

@niksedk Thx! I will test it :-)

niksedk commented 5 years ago

You can compile Tesseract via instructions here: https://github.com/tesseract-ocr/tesseract/wiki/Compiling

I use these commands

   vcpkg remove tesseract:x86-windows-static
   vcpkg install tesseract:x86-windows-static  --head
wtester7 commented 5 years ago

My result with ( 00274.zip ):

Tesseract 4.0.1 RC1 from https://github.com/SubtitleEdit/subtitleedit/files/2923526/tesseract4.0.1-rc1.zip:

SE Beta - All options off, Dictionary: None = 3 Minutes 22 Seconds

It's the fastest so far, a little bit faster ( 7 Seconds ) than the one from VietOCR 5.4.3. I guess this case is finally closed, thx @niksedk :-)

Ding-adong commented 5 years ago

Removed vietocr eng.traineddata 4,017kb and inserted SE eng.traineddata 22,917 back.

Tess 4 original - 34 seconds - 216 errors - spellchecker needed to correct some (less than above) LL inside words instead of ll. Other errors needed personal corrections, less than above. Slower post ocr with personal corrections due to LL

Tess 4 LSTM - 32 seconds - 236 errors - = made a comeback thus SE eng.traineddata is producing the errors not LSTM as vietocr doesn't produce = errors. Other errors needed personal corrections. Quicker overall when <Beginning from="= " to="- " /> in ocrreplacelist.

Tess 4 Tess + LSTM - 30 seconds - 222 errors - = made a comeback thus SE eng.traineddata is producing the errors not LSTM as vietocr doesn't produce = errors. Other errors needed personal corrections. Faster overall when <Beginning from="= " to="- " /> in ocrreplacelist and less personal corrections.

What is clear that vietocr tesseract.exe is better than SE version. I watched the cpu usage. SE tess.exe used all the cores at 100% then goes to 0% for a couple of seconds, 2 to 5, then 100% then 0%, thus taking 60ish seconds. Vietocr tess.exe used all the cores at 80 to 100% throughout and doesn't stop at 0% for a couple of seconds.

I removed all the vietocr files and only copied vietocr tesseract.exe to replace SE's. Kept SE eng.traineddata and so and so. There is no need to copy all of vietocr folder and all the files. LSTM is better at recogising l than without, so LSTM is a must.

The winner of my grand experiments is Tess 4 Tess + LSTM. Measured from start to finish, not just ocring, it is quicker.

wtester7 commented 5 years ago

You can compile Tesseract via instructions here: https://github.com/tesseract-ocr/tesseract/wiki/Compiling

I use these commands

   vcpkg remove tesseract:x86-windows-static
   vcpkg install tesseract:x86-windows-static  --head

Btw @niksedk why don't you use 64 Bit? Isn't 32 Bit much slower???

Ding-adong commented 5 years ago

tesseract4.0.1-rc1.zip result is the same as Tess 4 Tess + LSTM above. Tesseract.exe filesize is smaller than vietocr's version that's all.

wtester7 commented 5 years ago

tesseract4.0.1-rc1.zip result is the same as Tess 4 Tess + LSTM above. Tesseract.exe filesize is smaller than vietocr's version that's all.

Things are getting confusing @Ding-adong , I don't exactly know what you have written ;-)

For me Tesseract 4.0.1 RC1 is 7 Seconds faster than VietOCR's Tesseract 4...

Ding-adong commented 5 years ago

@niksedk Where did you source your tess file from? There is many versions going around.

wtester7 commented 5 years ago

Also it seems he is using 32 Bit Version instead of the 64 Bit Version, I would like to know why... I don't understand why people nowadays are still using 32 Bit... :-(

See comment here: https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-469052263

Most definitely that's why the performance loss...

Ding-adong commented 5 years ago

I thought SE migrated over to 64 Bit a couple of years ago.

wtester7 commented 5 years ago

And that's why https://github.com/SubtitleEdit/subtitleedit/files/2923526/tesseract4.0.1-rc1.zip was compiled in 64 Bit ( is it really 64 Bit?!?! ) that's why is also faster than the old Tesseract4 from Beta?

niksedk commented 5 years ago

SE runs 32-bit on 32-bit operating systems, and 64-bit on 64-bit operating systems... it's compiled to "ANY CPU" which is then converted to native code when running (or ngen).

The above "tesseract4.0.1-rc1.zip" (should really have been "tesseract4.1.0-rc1.zip" is 32-bit... I've started a 64-bit build, but it takes a while to compile on my old machine ;)

Ding-adong commented 5 years ago

New tess 3405kb - old tess 2190kb. Obviously some improvements were made.

niksedk commented 5 years ago

New fresh T4 64-bit build: tesseract4.1.0-x64-rc1.zip

wtester7 commented 5 years ago

So :

vcpkg install tesseract:x64-windows-static --head

is no performance gain vs:

vcpkg install tesseract:x86-windows-static --head

?

Edit: Thx will try the T4 64-bit build tesseract4.1.0-x64-rc1.zip

Ding-adong commented 5 years ago

64 build same result as Tess 4 Tess + LSTM. All is good.

wtester7 commented 5 years ago

Nope:

Your 64 build tesseract4.1.0-x64-rc1.zip is slower than the 32 Bit build from: Tesseract 4.1.0 RC1

Tesseract 4.1.0 32 Bit = 3 Minutes 22 Seconds Your Tesseract 4.1.0 64 Bit = 3 Minutes 48 Seconds

Btw, I am using Win7 x64 ;)

niksedk commented 5 years ago

OK, thx for the info :)

Help file updated a tiny bit: https://www.nikse.dk/SubtitleEdit/Help#importvobsub Added If you think Tesseract is too slow, you could set the spell check dictionary to "None" for better performance and then fix errors e.g. via "Fix common errors" plus spell check afterwards.

wtester7 commented 5 years ago

@niksedk Did you check my last post? https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-469057631

If so, did you find out why your 64 Build is slower than your 32 Build, although I'm using Win 7 x64?

Maybe did you have used?:

vcpkg install tesseract:x86-windows-static --head instead of vcpkg install tesseract:x64-windows-static --head

and it is an emulation?