Closed wtester7 closed 5 years ago
Your 443 lines DVD.sub file with all OCR options enabled ( except prompt unknown words disabled ):
With 3.5.6 in 1 Minute 34 Seconds finished With Beta in 2 Minutes 4 Seconds finished
I think I know where the problem lies.
In SE Version 3.5.7 nik linked SubtitleEdit.exe with Tesseract 4. In SE Version 3.5.8 nik linked SubtitleEdit.exe back to Tesseract 3.02 but enabled an option to download Tesseract 4 and use it. This means there is the support written in code -> SubtitleEdit.exe to use both Tesseract 3.02 and Tesseract 4. This support was not implemented and written in 3.5.6 SubtitleEdit.exe ( you can't choose what Tesseract Version to use for OCR ).
Maybe this support ( code ) since 3.5.7 is the cause for the massive performance slow down for Tesseract 3.02!
Your 443 lines DVD.sub file with all OCR options enabled ( except prompt unknown words disabled ):
With 3.5.6 in 1 Minute 34 Seconds finished With Beta in 2 Minutes 4 Seconds finished
Your beta test took 54 seconds longer than mine. Your computer's 'dying'.
Tried looking at the changes in codes and there's too many for me to help.
Your 443 lines DVD.sub file with all OCR options enabled ( except prompt unknown words disabled ): With 3.5.6 in 1 Minute 34 Seconds finished With Beta in 2 Minutes 4 Seconds finished
Your beta test took 54 seconds longer than mine. Your computer's 'dying'.
Tried looking at the changes in codes and there's too many for me to help.
Trust me, my computer is not dying :) You forgot that I have a 10 year old 1st Generation Quad Core i7 @ 2,67 Ghz and it's very healthy and stable. In the last 10 years I almost never had any bluescreen or shutdown or CPU failure. I can run a torture test in Prime95 without any problems.
In Passmark I have 4,920 points, you have a 6-Core ( I don't know which model ) that would be in the 15,000 points range. https://www.cpubenchmark.net/high_end_cpus.html This means your CPU is 3x times faster, thus 54 seconds faster, it's really simple math...
My old CPU doesn't matter, fact is that SubtitleEdit.exe of 3.5.6 is 2x - 2,5x times faster ( at least for me and most possible for other people too ) than SubtiteEdit.exe of 3.5.7 and other versions till Beta, because of certain code changes... possibly this one https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-468794906
I'm surprised everyone gets so vastly different benchmark results. I tried DVD.idx/.sub on my i5-2500K (AVX, no AVX2) with 3.5.9 beta 103: Tesseract 3.02: 28 seconds Tesseract 4 with "Neural Nets LSTM only": 3 minutes 55 seconds Over 8 times slower ...
I read the following on a different project:
I want to stress there are two independent reasons Tesseract 4 can be slow compared to 3:
- Multiprocessing parallel instances of Tesseract 4 without setting OMP_THREAD_LIMIT=1 (leads to too many processes fighting over CPU time)
- Running Tesseract 4 on a processor without AVX2 https://github.com/the-paperless-project/paperless/issues/438#issuecomment-462189569
I replaced tesseract 4 with binary taken from VietOCR 4.5.3 and it only took 1 minute 5 seconds.
Great catch sneaker2!
sneaker2 wrote: I'm surprised everyone gets so vastly different benchmark results.
Everyone has vastly different benchmark results because of different CPU speed. Faster CPU = faster OCR finish , slower CPU = slower OCR finish , it's really that simple... This means comparing different speed with other users here is useless, what matters is that you personally compare your results with different Subtitle Edit Versions and different Tesseract Versions like you did.
But you are right about the Tesseract 4 build from VietOCR-5.4.3. I have compared it with latest Subtitle Edit Beta's Tesseract 4 and VietOCR-5.4.3 Tesseract 4 is really faster: Benchmark done with my BluRay PGS mks file ( 00274.zip )
All OCR options enabled ( minus Prompt for unknown words disabled ) Subtitle Edit Beta's Tesseract 4: Engine Mode "Default" = 15 Minutes : 35 Seconds VietOCR 5.4.3 Tesseract 4: Engine Mode "Default" = 12 Minutes : 18 Seconds That is 3 Minutes 17 Seconds faster or 26%. This result is for me
Your result was even more impressive: 3 minutes 55 seconds vs 1 minute 5 seconds, that's 2 Minutes 50 Seconds difference, nearly 4x times faster!
What we know so far:
About the Tesseract 4 performance from VietOCR - it's probably because VietOCR uses lastest source, where SE uses 4.0 final. I just tested latest Tesseract source too and OCR'ed a vobsub in 2 mins 16 secs via latest source and got 2 mins 39 secs from T4 4.0.
I think the SE 3.5.7 uses a thread pool and SE 3.5.6 uses a single background thread, but there's has been a lot of other source changes too... more retries, resizing to get better results etc.
@wtester7: Did you set spell dictionary to "None" when doing these tests? SE counts unknown words and tries resizing/different parameters with Tesseract if unknown words > 0.
@wtester7: Did you set spell dictionary to "None" when doing these tests? SE counts unknown words and tries resizing/different parameters with Tesseract if unknown words > 0.
@niksedk I am sorry for doubting you, you're right about the re-tries, it really is the dictionary option. I guess Tesseract always uses spellchecking the dictionaries when dictionary with language is enabled. It also means the bigger the dictionaries, the slower the speed, right?
Tesseract 3.02 Test with ( 00274.zip ):
SE 3.5.6 - All options on ( - Prompt for unknown words ) - Dictionary: None = 1 Minute 28 Seconds SE 3.5.6 - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 1 Minute 24 Seconds
SE Beta - All options on ( - Prompt for unknown words ) - Dictionary: None = 59 Seconds SE Beta - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 59 Seconds
You're Beta is even more optimized and faster than Ver. 3.5.6! So all is good, no need to look for performance bugs ;-)
Still VietOCR 5.4.3's Tesseract 4 is really much faster than yours:
Tesseract 4 Tests, Mode = Default with ( 00274.zip ):
Tesseract 4 from SE Beta: SE Beta - All options on ( - Prompt for unknown words ) - Dictionary: None = 6 Minutes 1 Second SE Beta - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 5 Minutes 59 Seconds
Tesseract 4 from VietOCR 5.4.3: SE Beta - All options on ( - Prompt for unknown words ) - Dictionary: None = 3 Minutes 29 Seconds SE Beta - All options off ( only Italic + Music Symbol on ) - Dictionary: None = 3 Minutes 28 Seconds
Conclusion: I will stick to Subtitle Edit Beta but still use Tesseract 3.02. Tesseract 4 with Engine Mode "LSTM only" is really great for accuracy but it's really slow... Btw, can "LSTM only" detect Italic?
Thank you for your work Nik :-)
@wtester7: Thx for testing again :)
About T4: SE will include the faster version 4.1.0 when it's out in a final version (it's 4.1.0 RC1 atm). And no, I'm afraid it does not support italic formatting - but feel free to make a wish in their issue list ;)
By the way, how did you get a "tesseract.exe" from VietOCR 5.4.3 - I only got an dll file... and also, what version does this "tesseract.exe" report?
tesseract --version:
tesseract 4.0.0 leptonica-1.77.0 (Dec 27 2018, 14:56:24) [MSC v.1900 DLL Release x86] libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.9 : zlib 1.2.11 : libw ebp 0.6.1 : libopenjp2 2.3.0
thx :) OK, when 4.1.0 final is out I guess I'll have to test against tesseract from VietOCR. I wonder how it's compiled!? (I just used "vcpk")
T4 rc1 --version gives
tesseract 4.1.0-rc1
thx :) OK, when 4.1.0 final is out I guess I'll have to test against tesseract from VietOCR. I wonder how it's compiled!? (I just used "vcpk")
@niksedk Maybe you should look what @sneaker2 posted:
I read the following on a different project:
I want to stress there are two independent reasons Tesseract 4 can be slow compared to 3:
- Multiprocessing parallel instances of Tesseract 4 without setting OMP_THREAD_LIMIT=1 (leads to too many processes fighting over CPU time)
- Running Tesseract 4 on a processor without AVX2 the-paperless-project/paperless#438 (comment)
Or you can ask the dev of VietOCR directly here: https://github.com/nguyenq
@niksedk @wtester7: Did you set spell dictionary to "None" when doing these tests? SE counts unknown words and tries resizing/different parameters with Tesseract if unknown words > 0.
I did the experiment with the above advice and tess 4 was faster than all other previous experiments by 10 seconds. Previously tess 4 was slower by 10 seconds thus a swing/turnaround/saving of 20 seconds. I wish I knew about this before. How about a manual?
How precisely did you lot use vietocr?
Not sure if I did it right as it was 24 seconds slower than previous record. Everyone's claiming it's faster!!!
@Ding-adong see my benchmarks with options here: https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-469018111
Download VietOCR 5.4.3 here: https://sourceforge.net/projects/vietocr/files/vietocr/5.4.3/VietOCR-5.4.3.zip/download
Copy the content of tesseract-ocr folder from VietOCR 5.4.3 to Subtitle Edit Beta/Tesseract4 folder In the OCR options disable all options & set Dictionary to "None". The only enabled options are Italic + Music Symbols
Compare your results from VietOCR 5.4.3's Tesseract 4 with the original Tesseract 4 from Beta.
P.S I have deleted the vie.traineddata from VietOCR 5.4.3 in the tessdata folder because I don't need the Vietnamese language.
Tess 3.02 - 30 seconds - 202 errors - spellchecker needed to correct numerous LL
inside words instead of ll
. Other errors needed personal corrections.
Tess 4 original - 34 seconds - 212 errors - spellchecker needed to correct some (less than above) LL
inside words instead of ll
. Other errors needed personal corrections, less than above. Post OCR quicker by 2 ish minutes compared to above.
Tess 4 LSTM - 65 seconds - 260 errors - mostly —
or =
instead of -
. |
instead of I
- spellchecker needed to correct =
to -
and no LL
. Other errors needed personal corrections.
Tess 4 Tess + LSTM - 65 seconds - 233 errors - mostly —
or =
instead of -
. |
instead of I
- spellchecker needed to correct =
to -
and no LL
. Other errors needed personal corrections.
LSTM is good at recognising LL
and showed correct ll
The font in my vob l
looks like l
with a small curve dash at the bottom so tess sees it as LL
but LSTM sees it as ll
. However LSTM and tess sees —
instead of -
and |
instead of I
but both are easily fixed by fix common errors. Acceptable to me. LSTM also sees =
instead of -
and randomly picks =
or —
. The =
is a major let down. How can one dash becomes 2 dash =
, plainly ridiculous.
@niksedk I'd added 5 lines to the ocrreplacelist.
<Word from="He'Ll" to="He'll" />
<Word from="he'Ll" to="he'll" />
<Word from="We'Ll" to="We'll" />
<Word from="we'Ll" to="we'll" />
<Beginning from="= " to="- " />
Tested Tess 4 Tess + LSTM again and was a lot quicker than previous tests post ocr, as fix common errors corrected the =
into -
.
With Vietocr.
Tess 4 original - error - Tess returned with code 1.
Tess 4 LSTM - 31 seconds - 242 errors - no=
at all, strange. New errors such as "L
instead of "I
. Post OCR was quicker.
Tess 4 Tess + LSTM - error - Tess returned with code 1.
Can someone please explain the code 1 error.
@Ding-adong
VietOCR 5.4.3's Tesseract 4 is not fully compatible with SubtitleEdit.exe from Beta.
But Engine Mode "Default" and "LSTM only" is working.
Try to compare speed with VietOCR 5.4.3's Tesseract 4 with:
vs. Original Tesseract 4 from Beta with same settings...
Please let me know your speed result, VietOCR 5.4.3's Tesseract 4 should be much faster than the Original Tesseract 4 from Beta, based on my benchmark result and from @sneaker2
Btw. don't worry about the errors, you have disabled the dictionary/spellchecking and Fix OCR errors etc. These options should be only used for benchmark testing ;-)
Ding-adong wrote:
Tess 4 LSTM ( Original ) - 65 seconds
Tess 4 LSTM ( VietOCR 5.4.3 ) - 31 seconds
You see VietOCR 5.4.3's Tesseract is indeed much faster, for you it's 2x times faster, for sneaker2 it is 4x times faster and for me ( I did the test with Engine mode "Default" ) it is also nearly 2x times faster...
Btw @niksedk does Tesseract 4 Engine Mode "Default" can detect italic or only Engine Mode "Original Tesseract only" ?
It's really not saying much testing VietOCR Tesseract as we don't know the exact build / commit / date... Tesseract 4.1.0 from today is attached. tesseract4.0.1-rc1.zip
I think "Default" mode will use LSTM (neural networks) if available, otherwise fallback to something else, so it probably correct to choose "original"...
@niksedk Thx! I will test it :-)
You can compile Tesseract via instructions here: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
I use these commands
vcpkg remove tesseract:x86-windows-static
vcpkg install tesseract:x86-windows-static --head
My result with ( 00274.zip ):
Tesseract 4.0.1 RC1 from https://github.com/SubtitleEdit/subtitleedit/files/2923526/tesseract4.0.1-rc1.zip:
SE Beta - All options off, Dictionary: None = 3 Minutes 22 Seconds
It's the fastest so far, a little bit faster ( 7 Seconds ) than the one from VietOCR 5.4.3. I guess this case is finally closed, thx @niksedk :-)
Removed vietocr eng.traineddata 4,017kb and inserted SE eng.traineddata 22,917 back.
Tess 4 original - 34 seconds - 216 errors - spellchecker needed to correct some (less than above) LL inside words instead of ll. Other errors needed personal corrections, less than above. Slower post ocr with personal corrections due to LL
Tess 4 LSTM - 32 seconds - 236 errors - =
made a comeback thus SE eng.traineddata is producing the errors not LSTM as vietocr doesn't produce =
errors. Other errors needed personal corrections. Quicker overall when <Beginning from="= " to="- " />
in ocrreplacelist.
Tess 4 Tess + LSTM - 30 seconds - 222 errors - =
made a comeback thus SE eng.traineddata is producing the errors not LSTM as vietocr doesn't produce =
errors. Other errors needed personal corrections. Faster overall when <Beginning from="= " to="- " />
in ocrreplacelist and less personal corrections.
What is clear that vietocr tesseract.exe is better than SE version. I watched the cpu usage. SE tess.exe used all the cores at 100% then goes to 0% for a couple of seconds, 2 to 5, then 100% then 0%, thus taking 60ish seconds. Vietocr tess.exe used all the cores at 80 to 100% throughout and doesn't stop at 0% for a couple of seconds.
I removed all the vietocr files and only copied vietocr tesseract.exe to replace SE's. Kept SE eng.traineddata and so and so. There is no need to copy all of vietocr folder and all the files. LSTM is better at recogising l
than without, so LSTM is a must.
The winner of my grand experiments is Tess 4 Tess + LSTM. Measured from start to finish, not just ocring, it is quicker.
You can compile Tesseract via instructions here: https://github.com/tesseract-ocr/tesseract/wiki/Compiling
I use these commands
vcpkg remove tesseract:x86-windows-static vcpkg install tesseract:x86-windows-static --head
Btw @niksedk why don't you use 64 Bit? Isn't 32 Bit much slower???
tesseract4.0.1-rc1.zip result is the same as Tess 4 Tess + LSTM above. Tesseract.exe filesize is smaller than vietocr's version that's all.
tesseract4.0.1-rc1.zip result is the same as Tess 4 Tess + LSTM above. Tesseract.exe filesize is smaller than vietocr's version that's all.
Things are getting confusing @Ding-adong , I don't exactly know what you have written ;-)
For me Tesseract 4.0.1 RC1 is 7 Seconds faster than VietOCR's Tesseract 4...
@niksedk Where did you source your tess file from? There is many versions going around.
Also it seems he is using 32 Bit Version instead of the 64 Bit Version, I would like to know why... I don't understand why people nowadays are still using 32 Bit... :-(
See comment here: https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-469052263
Most definitely that's why the performance loss...
I thought SE migrated over to 64 Bit a couple of years ago.
And that's why https://github.com/SubtitleEdit/subtitleedit/files/2923526/tesseract4.0.1-rc1.zip was compiled in 64 Bit ( is it really 64 Bit?!?! ) that's why is also faster than the old Tesseract4 from Beta?
SE runs 32-bit on 32-bit operating systems, and 64-bit on 64-bit operating systems... it's compiled to "ANY CPU" which is then converted to native code when running (or ngen).
The above "tesseract4.0.1-rc1.zip" (should really have been "tesseract4.1.0-rc1.zip" is 32-bit... I've started a 64-bit build, but it takes a while to compile on my old machine ;)
New tess 3405kb - old tess 2190kb. Obviously some improvements were made.
New fresh T4 64-bit build: tesseract4.1.0-x64-rc1.zip
So :
vcpkg install tesseract:x64-windows-static --head
is no performance gain vs:
vcpkg install tesseract:x86-windows-static --head
?
Edit: Thx will try the T4 64-bit build tesseract4.1.0-x64-rc1.zip
64 build same result as Tess 4 Tess + LSTM. All is good.
Nope:
Your 64 build tesseract4.1.0-x64-rc1.zip is slower than the 32 Bit build from: Tesseract 4.1.0 RC1
Tesseract 4.1.0 32 Bit = 3 Minutes 22 Seconds Your Tesseract 4.1.0 64 Bit = 3 Minutes 48 Seconds
Btw, I am using Win7 x64 ;)
OK, thx for the info :)
Help file updated a tiny bit: https://www.nikse.dk/SubtitleEdit/Help#importvobsub
Added If you think Tesseract is too slow, you could set the spell check dictionary to "None" for better performance and then fix errors e.g. via "Fix common errors" plus spell check afterwards.
@niksedk Did you check my last post? https://github.com/SubtitleEdit/subtitleedit/issues/3431#issuecomment-469057631
If so, did you find out why your 64 Build is slower than your 32 Build, although I'm using Win 7 x64?
Maybe did you have used?:
vcpkg install tesseract:x86-windows-static --head instead of vcpkg install tesseract:x64-windows-static --head
and it is an emulation?
Hello,
I have discovered a nasty BUG in the portable Subtitle Edit 3.5.10 Version and I am really disappointed with Tesseract 4, it's just too slow!!! I'm using a first generation Quad Core 2,67 Ghz and Win7 x64.
I have done various benchmarks with a BluRay 1117 lines PGS and the OCR with Tesseract 4:
Subtitle 3.5.10 portable:
Tesseract 3.02 ( 4 errors ) = 07 Minutes : 20 Seconds Tesseract 4 , Engine Mode "Default" ( 4 errors ) = 15 Minutes : 35 Seconds Tesseract 4 , Engine Mode "Original Tesseract only" ( 14 errors ) = 12 Minutes : 32 Seconds Tesseract 4 , Engine Mode "LSTM only" ( 2 errors ) = 15 Minutes : 22 Seconds Tesseract 4 , Engine Mode "Tesseract + LSTM" ( 5 errors ) = 16 Minutes : 46 Seconds
Subtitle 3.5.6 portable: Tesseract ( 5 errors ) = 01 Minute : 50 Seconds !!!
Bug in Subtitle 3.5.10 portable:
Bug: Prompt for unknown Words Popup - Change all Button - replaces original eng_OCRFixReplaceList.xml ( 136 KB ) into a eng_OCRFixReplaceList.xml ( 2 KB ) - everything from the original eng_OCRFixReplaceList.xml is lost!
WORKS: Prompt for unknown Words Popup - Add to names/noise list Button - adding words to en_names.xml. WORKS: Prompt for unknown Words Popup - Add to user dictionary Button - adding words to en_US_user.xml. WORKS: Prompt for unknown Words Popup - USE ALWAYS - adds words to eng_OCRFixReplaceList_User.xml
Bugs in Subtitle 3.5.6 portable:
Bug: Prompt for unknown Words Popup - Change all Button - replaces original eng_OCRFixReplaceList.xml ( 136 KB ) into a eng_OCRFixReplaceList.xml ( 2 KB ) - everything from the original eng_OCRFixReplaceList.xml is lost! Bug: Prompt for unknown Words Popup - Add to names/noise list Button - doesnt add words from the unknown Words Popup into en_names.xml, but from the unknown words table list it adds into en_names.xml Bug: Prompt for unknown Words Popup - Add to user dictionary Button - doesnt add words from the unknown Words Popup into en_US_user.xml, but from the unknown words table list it adds into en_US_user.xml
WORKS: Prompt for unknown Words Popup - USE ALWAYS - adds words to eng_OCRFixReplaceList_User.xml
As you can see the Tesseract in Subtitle 3.5.6 portable is the fastest, OCR takes only 01 Minute : 50 Seconds with 5 errors!!! What is going on with Tesseract 4??? It's 8x times slower with more errors!!!
It would be really great if you can provide a new version with the Tesseract from 3.5.6 with all mentioned bugs fixed. Well most mentioned bugs are fixed in version 3.5.10, the only remaining bug is the:
Prompt for unknown Words Popup - Change all Button - replaces original eng_OCRFixReplaceList.xml ( 136 KB ) into a eng_OCRFixReplaceList.xml ( 2 KB ) - everything from the original eng_OCRFixReplaceList.xml is lost!
I would really appreciate it, thanks!