Closed lxs602 closed 2 years ago
Tesseract packages installed:
dpkg -l | grep tesseract
ii gimagereader 3.2.3-2 amd64
ii libtesseract-dev 4.00~git2288-10f4998a-2 amd64
ii libtesseract4 4.00~git2288-10f4998a-2 amd64
ii` tesseract-ocr 4.00~git2288-10f4998a-2 amd64
ii tesseract-ocr-eng 4.00~git24-0e00fe6-1.2 all
ii tesseract-ocr-ita 4.00~git24-0e00fe6-1.2 all
ii tesseract-ocr-osd 4.00~git24-0e00fe6-1.2 all
EDIT: I'm not sure why some of the lines above are strikethrough formatted.
Release 4.00~git2288-10f4998a-2
was published on 2018-04-20.
Perhaps the current stable release is required?
To avoid the strike through, put a line with exactly three backticks (```) before and after your text.
Interestingly, if I start at the very last line, tesseract seems to work.
Also, when this is done, I can then get it to read the second-to-last line.
It seems to work from bottom-to-top this way.
EDIT: I hope this makes sense... I will try to rewrite it if not. See the image below: https://pasteboard.co/IInTR8iw.png
I have just installed the latest ppa (4.1.0+git4239) from ppa:alex-p/tesseract-ocr, with no change.
dpkg -l | grep tesseract:
ii libtesseract-dev:amd64 4.1.0+git4239 -6343f0ab-1ppa1~bionic1 amd64
ii libtesseract4:amd64 4.1.0+git4239 -6343f0ab-1ppa1~bionic1 amd64
ii tesseract-ocr 4.1.0+git4239 -6343f0ab-1ppa1~bionic1 amd64
ii tesseract-ocr-eng 1:4.0.0+git39-6572757-1ppa1~bionic1 all
ii tesseract-ocr-ita 1:4.0.0+git39-6572757-1ppa1~bionic1 all
ii tesseract-ocr-osd 1:4.0.0+git39-6572757-1ppa1~bionic1 all
Is there a way to enable tesseract debug to see the errors, when using it with SubtitleEdit?
I had the same problem. tried a few things and still couldn't get it to work. I ended up using SE on Windows lol.
It works on wine-staging if you install Dot Net 4.62 using winetricks (https://wiki.winehq.org/Winetricks), but I would rather use it natively than on wine if I can.
EDIT: Also had to set wine to Windows 2003 using winecfg.
It must be what command SE is passing to Tesseract. Is it possible to enable debug output with SE/Tesseract?
If you OCR only the last line it works (see post 4 above).
I will try compiling Tesseract rather than using the PPA and see what happens.
@lxs602
You might try SE-3.5.11-issue3851.7z
from this DropBox folder. It writes the invoked Tesseract command and some other info to a log file (subtitle-edit.log
). Perhaps, it can help you to figure out what's going wrong.
Hi, this crashes instead of generating orange blank lines. EDIT: I found the log file... on the Desktop.
What does "this crashes" mean? Does it throw an exception? Is there an exception message?
@lxs602
I've updated SE-3.5.11-issue3851.7z
. New patch is more robust. Log file lines are numbered. If you sort the log file when OCR has finished, then the lines associated with the same image will be in sequential order.
Updated SE-3.5.11-issue3851.7z
again. Two bugs have been fixed.
@xylographe what did you change in this build?
@xylographe, I'm sorry, my question was stupid. For debug output I should do (?):
MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" mono SubtitleEdit.exe > SubtitleEdit.log
Thank you for helping though. The patched version crashes and does not give an exception error. Maybe it does not matter, as I can use the command above for debug?
But just so you know, the patched version gives this error using MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all":
[ERROR] FATAL UNHANDLED EXCEPTION: System.InvalidOperationException: Process must exit before requested information can be determined. at System.Diagnostics.Process.EnsureState (System.Diagnostics.Process+State state) [0x0008d] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at System.Diagnostics.Process.get_ExitCode () [0x00000] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at (wrapper remoting-invoke-with-check) System.Diagnostics.Process.get_ExitCode() at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractRunner.Run (System.String languageCode, System.String psmMode, System.String engineMode, System.String imageFileName, System.Boolean run302) [0x002c2] in <288f00de052a4d5c9bffcee795eebce7>:0 at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractThreadRunner.DoOcr (System.Object j) [0x00034] in <288f00de052a4d5c9bffcee795eebce7>:0 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context (System.Object state) <0x7fdaf2f4cf90 + 0x0004b> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f44a80 + 0x0014d> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f449c0 + 0x00041> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () <0x7fdaf2f4cf00 + 0x00046> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00074] in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () <0x7fdaf2f4cdd0 + 0x00018> in <285579f54af44a2ca048dad6be20e190>:0 [ERROR] FATAL UNHANDLED EXCEPTION: System.InvalidOperationException: Process must exit before requested information can be determined. at System.Diagnostics.Process.EnsureState (System.Diagnostics.Process+State state) [0x0008d] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at System.Diagnostics.Process.get_ExitCode () [0x00000] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at (wrapper remoting-invoke-with-check) System.Diagnostics.Process.get_ExitCode() at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractRunner.Run (System.String languageCode, System.String psmMode, System.String engineMode, System.String imageFileName, System.Boolean run302) [0x002c2] in <288f00de052a4d5c9bffcee795eebce7>:0 at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractThreadRunner.DoOcr (System.Object j) [0x00034] in <288f00de052a4d5c9bffcee795eebce7>:0 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context (System.Object state) <0x7fdaf2f4cf90 + 0x0004b> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f44a80 + 0x0014d> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f449c0 + 0x00041> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () <0x7fdaf2f4cf00 + 0x00046> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00074] in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () <0x7fdaf2f4cdd0 + 0x00018> in <285579f54af44a2ca048dad6be20e190>:0
Using the unpatched SubtitleEdit, here are the full debug logs:
MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" mono SubtitleEdit.exe > SubtitleEdit.log
Blank lines using OCR: https://gofile.io/?c=yQDc1T
Successful OCR (but only if starting at the last line): https://gofile.io/?c=p4HnSu
I am not sure what the problem is. I did notice:
Mono: [0x7fe7bed36700] try unpark worker Mono: [0x7fe7bed36700] try unpark worker, success? no Mono: [0x7fe7bed36700] try create worker Mono: [0x7fe7beb35700] worker starting Mono: [0x7fe7beb35700] worker executing Mono: [0x7fe7beb35700] worker running in domain 0x556d3561a080 (outstanding requests 0) Mono: [0x7fe7bed36700] try create worker, created 0x7fe7beb35700, now = 8091 count = 2 Mono: [0x7fe7bed36700] request worker, created Mono: AOT: FOUND method System.Random:Next (int) [0x7fe7de294880 - 0x7fe7de294910 0x7fe7de77b296] Mono: AOT: FOUND method System.Random:Sample () [0x7fe7de294420 - 0x7fe7de294450 0x7fe7de77b26f] Mono: AOT: FOUND method System.Threading.QueueUserWorkItemCallback:System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () [0x7fe7de34df00 - 0x7fe7de34df80 0x7fe7de77febd] Mono: [0x7fe7beb35700] worker parking Mono: AOT: FOUND method System.Threading.Timer/Scheduler:TimerCB (object) [0x7fe7de355a40 - 0x7fe7de355b50 0x7fe7de7801ed] Mono: AOT NOT FOUND: (wrapper remoting-invoke-with-check) System.Threading.Timer:Dispose (). Mono: AOT: FOUND method System.Threading.Timer:Dispose () [0x7fe7de354a10 - 0x7fe7de354a50 0x7fe7de780140] Mono: event_create: creating Event handle Mono: mono_w32handle_new: create Event handle 0x556d3560a538 Mono: mono_w32handle_ref_core: ref Event handle 0x556d3560a538, ref: 1 -> 2 Mono: event_handle_create: created Event handle 0x556d3560a538 Mono: mono_w32handle_unref_core: unref Event handle 0x556d3560a538, ref: 2 -> 1 destroy: false Mono: AOT: FOUND method System.Threading.ExecutionContext:IsFlowSuppressed () [0x7fe7de346660 - 0x7fe7de3466f0 0x7fe7de77fc20] Mono: AOT: FOUND method System.Threading.ExecutionContext:Capture () [0x7fe7de3466f0 - 0x7fe7de346730 0x7fe7de77fc24] Mono: AOT: FOUND method System.Runtime.InteropServices.GCHandle:Alloc (object) [0x7fe7de4892c0 - 0x7fe7de489300 0x7fe7de7896f7] Mono: AOT NOT FOUND: (wrapper managed-to-native) System.Threading.ThreadPool:NotifyWorkItemComplete (). Mono: AOT: FOUND method System.Runtime.InteropServices.GCHandle:op_Explicit (intptr) [0x7fe7de4893c0 - 0x7fe7de489450 0x7fe7de7896fb] Mono: AOT NOT FOUND: (wrapper managed-to-native) System.Runtime.InteropServices.GCHandle:CheckCurrentDomain (int). Mono: [0x7fe7bed36700] hill climbing, change max number of threads 4 Mono: [0x7fe7bed36700] worker parking Mono: AOT: FOUND method System.Threading.ExecutionContext:Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object) [0x7fe7de345930 - 0x7fe7de3459c0 0x7fe7de77fbdc] Mono: AOT: FOUND method System.Threading.ExecutionContext:OnAsyncLocalContextChanged (System.Threading.ExecutionContext,System.Threading.ExecutionContext) [0x7fe7de6e2080 - 0x7fe7de6e2431 0x7fe7de77fb9b] Mono: AOT: FOUND method System.Delegate:DynamicInvoke (object[]) [0x7fe7de313b00 - 0x7fe7de313b40 0x7fe7de77e659] Mono: AOT: FOUND method System.MulticastDelegate:DynamicInvokeImpl (object[]) [0x7fe7de318c60 - 0x7fe7de318d10 0x7fe7de77e917] Mono: mono_w32handle_ref_core: ref Event handle 0x556d3560a538, ref: 1 -> 2 Mono: ves_icall_System_Threading_Events_SetEvent_internal: setting Event handle 0x556d3560a538 Mono: mono_w32handle_unref_core: unref Event handle 0x556d3560a538, ref: 2 -> 1 destroy: false Mono: AOT: FOUND method System.Runtime.InteropServices.GCHandle:Free () [0x7fe7de489350 - 0x7fe7de4893b0 0x7fe7de7896f9] Mono: AOT NOT FOUND: (wrapper managed-to-native) System.Runtime.InteropServices.GCHandle:FreeHandle (int). Mono: DllImport searching in: 'libX11.so.6' ('libX11.so.6'). Mono: Searching for 'Xutf8ResetIC'. Mono: Probing 'Xutf8ResetIC'. Mono: Found as 'Xutf8ResetIC'. Mono: DllImport searching in: 'libX11.so.6' ('libX11.so.6'). Mono: Searching for 'XUnsetICFocus'. Mono: Probing 'XUnsetICFocus'. Mono: Found as 'XUnsetICFocus'.
Thank you, @lxs602
The first DEBUG output (running the patched SE) is rather informative. It contains a stack trace that should be read in reverse order (from bottom to top). Note that ImageJob is a class that contains all information to convert an image to text via tesseract(1).
The interesting part of the stack trace starts with
at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractThreadRunner.DoOcr (System.Object j)
Meaning that a previously scheduled ImageJob (ThreadPool.QueueUserWorkItem(DoOcr, job)
in TesseractThreadRunner.cs
) is being started.
at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractRunner.Run (System.String languageCode, System.String psmMode, System.String engineMode, System.String imageFileName, System.Boolean run302)
The Run() method of a new TesseractRunner instance is invoked. This is where the actual conversion will be performed.
at System.Diagnostics.Process.get_ExitCode ()
Tesseract(1) has been run (its invocation should be logged in subtitle-edit.log
). We have reached line 90 (TesseractRunner.cs
) right after process.WaitForExit(8000)
finished. Line 90 logs the Process.HasExited and Process.ExitCode properties, which should tell us if running tesseract(1) succeeded or failed.
at System.Diagnostics.Process.EnsureState (System.Diagnostics.Process+State state)
Check that the Process has exited, otherwise there won't be an ExitCode.
FATAL UNHANDLED EXCEPTION: System.InvalidOperationException: Process must exit before requested information can be determined.
Apparently, the process did not yet terminate, therefore accessing the ExitCode property throws an exception, but because it is UNHANDLED, no message box is shown. Under Windows this non-UI exception would be appended to the Windows event log, and the system default handler would report the exception to the user before terminating the application.
First conclusion, I made a mistake. Instead of
Log(id, $"hasexited=|{process.HasExited}| exitcode=|{process.ExitCode}|");
I should have written
if (process.HasExited)
{
Log(id, $"hasexited=|{process.HasExited}| exitcode=|{process.ExitCode}|");
}
else
{
Log(id, $"hasexited=|{process.HasExited}|");
}
Second (much more important) conclusion, the OCR failure (empty text lines) is most likely caused by tesseract(1) taking more than 8000 milliseconds to finish. A possible quick solution would be to increase the maximum wait time (15 seconds perhaps).
I updated SE-3.5.11-issue3851.7z
again. Its source code is in the xg/issue3851 branch.
subtitle-edit.log
file (located on the Desktop).Please, check if increasing the Process.WaitForExit() time delay solves the Tesseract OCR problem.
Should you want to inspect subtitle-edit.log
, remember to sort the file first
(e.g. sort subtitle-edit.log >subtitle-edit.sorted.log
).
Hi, no fix, unfortunately... logs are below. Error seems to be:
0077 000461 [parsing out-file] "<span class='ocr_line'" or "<span class='ocr_header'" not found
Subtitle-edit.log https://gofile.io/?c=vnhNdu
Mono debug: https://gofile.io/?c=xutyXY
From SE log:
0001 000001 [Tesseract OCR] image-file=|/tmp/52a9e267-5701-4a04-947e-e2ff904d77b8.png|
0001 000002 cwd=|| file=|tesseract| args=|"/tmp/52a9e267-5701-4a04-947e-e2ff904d77b8.png" "/tmp/d02e91a6-a86e-46e2-8d88-9d1c7588dfe4" -l eng --oem 3 hocr|
Start of first ImageJob. From MONO log:
Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/52a9e267-5701-4a04-947e-e2ff904d77b8.png" "/tmp/d02e91a6-a86e-46e2-8d88-9d1c7588dfe4" -l eng --oem 3 hocr]
Confirmed by mono. From SE log:
0001 000115 hasexited=|False|
Same problem as before: tesseract did not exit yet (after 30 seconds). Apparently, tesseract is blocking. As I don't know exactly why, I made several changes in TesseractRunner.Run(), that may solve this problem. I also noticed too many (57) invocations of tesseract running in parallel.
Because the updated SE-3.5.11-issue3851.7z is rather slow, do not convert a complete subtitle, only the last 7-10 lines. In the worst case scenario that could still take 5-6 minutes.
Excellent, it worked. OCR was rapid, so no need for a 30 second timeout.
Mono log: https://gofile.io/?c=fIOeNw
Subtitle-edit log: https://gofile.io/?c=wVZOwq
The only thing now though is when using 'Prompt for unknown words', spell check has lost some options... it no longer allows Adding to Name List / Skip One / etc. https://gofile.io/?c=QxRswU
In this update tesseract concurrency has been restored, albeit with a limit of 11 concurrent invocations to prevent unpleasant surprises. When testing, please, disable ‘Prompt for unknown words’.
‘Spell check’ dialogues (word-only/whole-text) when running on Windows:
Hi, Tesseract exited with a timeout error, unfortunately. Testing was with 'Prompt for Unknown Words' disabled.
I noticed that SubtitleEdit was very slow after upgrading to mono 6.6.0.161. I have reverted to mono 6.4.0-198, and then 6.6.0, which were not slow.
Mono 6.4.0-198 Mono log: https://gofile.io/?c=E5S6tO
SubtitleEdit log: https://gofile.io/?c=P8eo1J
Mono 6.6.0 Mono log: https://gofile.io/?c=eUZluO
SubtitleEdit log: https://gofile.io/?c=WcwKkA
Mono 6.6.0-161 Mono log: https://gofile.io/?c=tZaYHU
@lxs602 how did you revert back to Mono 6.6.0? I'm currently stuck at 6.7 because the official mono repo doesn't offer older versions.
Excellent! We have now narrowed it down to two possible causes. I'm hoping this update will yield the desired result. If not, there is only one option left.
@NickZ
You can install specific snapshots from the Mono repository. On Ubuntu/Debian, you would edit /etc/apt/sources.list.d/mono-<_name-may-vary_>.list :
and change the name of your distribution, and specify the directory for the version in the repo, e.g. from:
deb http://download.mono-project.com/repo/ubuntu bionic main
to:
deb http://download.mono-project.com/repo/ubuntu bionic/snapshots/6.6.0 main
Then delete (and purge) all current installed mono files, apt-get update
, and reinstall. I followed this guide here: https://stackoverflow.com/questions/33763177/install-older-version-of-mono.
@xylographe
Out of interest, what are they?
@lxs602 A change of strategy, as we're almost at the finish. I have uploaded three archives at once:
Try SE-3.5.11-issue3851C.7z
first. If it works, it will become the final solution (after removing logging, of course).
Try SE-3.5.11-issue3851B.7z
only if SE-3.5.11-issue3851C.7z
didn't work. The only difference is the way tesseract's standard error is processed (in [B] it's send to /dev/null
).
Try SE-3.5.11-issue3851A.7z
only if SE-3.5.11-issue3851B.7z
didn't work. The difference is a Process.Refresh()
, which shouldn't be necessary, but you never know.
The cause of Tesseract failure on Ubuntu is that tesseract gets blocked when several instances are being run concurrently. This might indicate a dead-lock while acquiring resources. The short-time solution is to start tesseract instances sequentially.
Hi, 'C' was successful, but very slow. There was no noticeable difference between 'A' and 'B'.
Log files for each of the three versions are below: https://gofile.io/?c=HzJxXr
I'm not sure there was much difference in speed from when tesseract was running in single instance though... it may even have seemed a little slower.
Do you still have the previous version to test against, from the comment below?: https://github.com/SubtitleEdit/subtitleedit/issues/3851#issuecomment-562847141
Do you still have the previous version to test against, …
Sort of. :) It is either SE-3.5.11-issue3851D.7z
or SE-3.5.11-issue3851E.7z
I assume you are asking because it was the fastest of the lot:
(D or E) 208 (A) 759 (B) 283 (C) 1816 [averages in milliseconds].
The difference between B and C is understandable. C processes tesseract's stderr (socket), B discards (/dev/null
) all output from tesseract. The difference between A and B is inexplicable, the tiny modification in the code happens after the end time has been determined: in both A and B the start time is set immediately before Process.Start()
, the end time is set immediately after Process.WaitForExit()
, the code in between is exactly the same.
Hi, I tested all five versions using the same mkv file. There was less noticeable difference between them this time without other programs running in the background. https://gofile.io/?c=NUE0It
I then tried another, longer, mkv file, which generated an 'out of memory' error in all five versions (the log files are large): https://gofile.io/?c=BCCHcs
Is there any further way adjust the invoking of Tesseract, so that multiple instances (A/B) complete OCR more quickly than in single instance (D/E)?
If there are any other debug techniques I can do to help, let me know. I am aware of monodevelop, and others, though not being a programmer I have not used them.
Running tesseract successfully: (from Mono log)
Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/e9e9ba1a-a075-44a8-8cca-0e97e7c259fe.png" "/tmp/66ff7f55-2455-4252-9ac9-39434b046751" -l eng --oem 3 hocr]
Mono: mono_w32handle_new: create Process handle 0x56381d73e548
Mono: mono_w32handle_ref_core: ref Process handle 0x56381d73e548, ref: 1 -> 2
Mono: mono_w32handle_ref_core: ref Process handle 0x56381d73e548, ref: 2 -> 3
Mono: mono_w32handle_unref_core: unref Process handle 0x56381d73e548, ref: 3 -> 2 destroy: false
Mono: process_create: returning handle 0x56381d73e548 for pid 9958
A new process has been created, pid 9958 is the fork() return value in the parent.
Mono: process_wait (0x56381d73e548, 29733): PID: 9958
Mono: process_wait (0x56381d73e548, 29733): waiting on semaphore for 29733 ms...
Mono: process_wait (0x56381d73e548, 29733): Waited successfully
Mono: process_wait (0x56381d73e548, 29733): Setting pid 9958 signalled, exit status 0
Tesseract has finished successfully.
Mono: mono_w32handle_unref_core: unref Process handle 0x56381d73e548, ref: 1 -> 0 destroy: true
Mono: w32handle_destroy: destroy Process handle 0x56381d73e548
Mono: process_close
Mono: processes_cleanup
Mono: processes_cleanup done
At this point all resources used for this process should have been released.
When things go wrong:
Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/f20151bd-5d65-4935-bc89-97f952c7857b.png" "/tmp/6a473ef4-578d-4c7e-b318-345877a5d385" -l eng --oem 3 hocr]
Mono: process_create: returning handle (nil) for pid -1
The fork() return value is minus one with errno set to ENOMEM (or perhaps EAGAIN). This could, for example, happen if too many zombie children are waiting to be reaped. However, it seems rather unlikely that such a far-reaching bug would have stayed unnoticed for such a long time. But, whatever the cause, I'm afraid there's nothing SE can do to avoid it. Nonetheless, I'll try to limit OCR memory use in SE as much as possible in SE-3.5.11-issue3851F.7z
.
I created a large (9545 images) VobSub and processed it with SE-3.5.11-issue3851F
. It took an awfully long time on Windows (almost 500ms per image, compared to less than 200ms on Linux), but eventually it did finish without running out of memory, or other exceptions.
This time try running mono without MONO_LOG_LEVEL and MONO_LOG_MASK, and after SE has started, immediately select "Import from Matroska file".
@lxs602 have you been able to successfully choose "Tessaract only (can do italics)"? I have not been able to on Linux. Which version of Tessaract are you using? The one from the PPA?
@NickZ, I recently reinstalled Ubuntu after breaking some settings, and upgraded to Ubuntu 19.10. I had Tesseract from the PPA on Ubuntu 18.04, but I am now using the same version as from the PPA, but present in the normal repository on 19.10.
Can you choose "Tesseract 4" instead of "Binary image compare", under OCR method?
What Linux distribution are you using?
A screenshot of the SubtitleEdit running on my computer is below: https://gofile.io/?c=FJIhmF
@xylographe, I have tried the new version which ran fine. I suppose that is all for invoking Tesseract and for managing memory? https://gofile.io/?c=NjUHrU
There was a small error on trying to close SE. It then closes on the second attempt without an error message. I ran SE again with mono debugging to capture the error (see also the screenshots): https://gofile.io/?c=VXEf3v
@lxs602 The error (exception) is caused by a missing assembly (System.Web.Services).
From a previous Mono log:
Mono: [...] looking for System.Web.Services, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Mono: Assembly Loader probing location: '/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll'.
Mono: Image addref System.Web.Services[0x562fa4d16610] (asmctx DEFAULT) -> /usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll[0x562fd962f0c0]: 2
Mono: Prepared to set up assembly 'System.Web.Services' (/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll)
Mono: Assembly System.Web.Services[0x562fa4d16610] added to domain SubtitleEdit.exe, ref_count=1
Mono: Assembly Loader loaded assembly from location: '/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll'.
The System.Web.Services assembly was found in the GAC.
From the last Mono log:
Mono: Assembly Loader probing location: '/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll'.
Mono: The following assembly [...] could not be loaded:
Assembly: System.Web.Services (assemblyref_index=3)
Version: 4.0.0.0
Public Key: b03f5f7f11d50a3a
The assembly was not found in the Global Assembly Cache, a path listed in the MONO_PATH environment variable, or in the location of the executing assembly.
File /usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll
no longer exists. Perhaps, it was removed while upgrading/downgrading Mono?
I have uploaded SE-3.5.11-issue3851H.7z
. I expect this (minus the logging, of course) to be the final version. In case this version doesn't work on Ubuntu, I provided fallback version SE-3.5.11-issue3851G
.
Merry Christmas, Waldi
Hi,
/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll no longer exists. Perhaps, it was removed while upgrading/downgrading Mono?
Sorry, I hadn't installed libmono-system-web-services4.0-cil when I installed Ubuntu again. It was installed as a dependency of mono-devel.
I tested both new versions, using G and then H with Film2, for comparison. https://gofile.io/?c=glZheG
Both G and H performed well.
I then used H with Film1 while the system was under heavy load.
On H, Tesseract struggled a bit under a loaded system, and gave a few timeout errors. Perhaps you would want to prompt an error dialogue, to say (something like): 'OCR complete, tesseract timed out on lines x,y and z, try running as a higher priority or switching off other applications'?
With the 'Prompt for Unknown Words' option, I found that pressing 'Edit whole text', and then 'Edit word only', brought back all the normal options.
Thank you for helping out and for your time.
Happy new year. L
New version without logging in SE-3.5.11-issue3851.7z.
@xylographe thanks for the update; this seems to be working a lot better now. I finally fixed the problem with running on "Tesseract only", which was a problem with Tesseract itself. However, when selecting "Tesseract only (can detect italics)", it seems to end up skipping a lot of lines. Rerunning it on those lines fixes those, so I think that concurrency may be to blame? Is there a way to turn off concurrency when Tesseract only is selected?
Hi, I just tried each of the four Engine Modes. I had left it with 'Default' so far.
Character recognition worked the same on 'Neural Nets LTSM only' as with 'Default, based on what is available'. However, only blank orange lines were produced on 'Original Tesseract Only' and on 'Tesseract + LTSM'.
This would lead me to believe that 'Default' is using only 'Neural Nets LTSM', and since they both gave the message, 'Invalid Resolution 0 dpi'. Conversely, Tesseract seems not to be working, by itself or with LTSM.
Again out of interest, what are Neural Nets and LTSM? When I searched for them I retrieved information on Deep Learning. Is this an alternative implementation of Tesseract, or something else?
I have attached a few logs again using H. https://gofile.io/?c=5Wp72J
@NickZ, why don't you upload some error logs?
@NickZ SE-3.5.11-issue3851 never invokes tesseract more than once.
@lxs602 To use Original Tesseract you need traineddata from the ‘legacy’ set. See this comment and this comment.
@lxs602 I've examined the logs. Because you are using traineddata from the tessdata-fast set (Ubuntu default set) only LSTM can work. Hence, Default must choose LSTM as nothing else is available. When forced to use Tesseract only (or Tesseract+LSTM), no output file is generated.
Did you get the 'Invalid Resolution 0 dpi' warning for each image?
I know next to nothing about neural networks, but apparently Neural Networks and Deep Learning provides a good general introduction. More specific info about neural net OCR is available in Using Neural Networks for Optical Character Recognition.
Hi,
I found a few articles suitable for lay readers, such as below.
In summary, 'LTSM and Neural Networks' is the new engine, and 'Original Tesseract' is the legacy engine.
https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/ https://open.live.bbc.co.uk/mediaselector/6/redir/version/2.0/mediaset/audio-nondrm-download/proto/https/vpid/p07sy4sz.mp3
I got 'Invalid Resolution 0 dpi' only on lines 1 and 2.
I have uploaded the file for the video below if it helps. https://gofile.io/?c=3XNUOV
@lxs602 I've only ever been able to use LTSM via Mono (or the "based on what's available" setting, which as you said looks like it's LTSM). Reviewing this thread it sounded like earlier you were actually using Tesseract though.
Is that the case, or were you using only LTSM the entire time?
If you did get it working, would you be able to summarize the steps you took? I've tried installing the dependencies/tesseract/etc. and no joy.
(Also, re: the OCR going backwards/jumping around, I wonder if some of the improvements in the neural network translations are because it tries to use the context of other lines to improve its recognition.)
@cecoates, Hi, I was using 'Default' for all the comments above (unless specified otherwise), which appeared to have only been using LTSM.
I have just tried Tesseract legacy, using the best-data set which apparently it needs, and it seems to work well.
I used the steps below, with help from the comment above, for the English language:
cd /usr/share/tesseract-ocr/4.00/tessdata/
sudo mkdir fastdata_backup
sudo mv *.traineddata fastdata_backup
sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata
I must be doing something wrong/missing an important step. Now even just neural net isn't working for me.
I installed these packages: https://www.nikse.dk/SubtitleEdit/Help#linux
Then downloaded the portable and beta Subtitle Edit: https://github.com/SubtitleEdit/subtitleedit/releases
Follow the steps you mentioned:
cd /usr/share/tesseract-ocr/4.00/tessdata/
sudo mkdir fastdata_backup
sudo mv *.traineddata fastdata_backup
sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata
Extract the Subtitle Edit zips, and used mono to run them:
mono SubtitleEdit.exe
Import an MKV/PGS/.sup file.
Change OCR method to Tesseract 4.
Language: English.
Original Tesseract Only.
Start OCR.
Then I get hit with the error/popup
Tessaract returned with code 1
And it pops up twice.
Then I see nothing but blank orange lines.
However, originally for me the Neural Nets method worked, but now that causes "Tesseract returned with code 1" errors too:
Attaching log from using:
MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" mono SubtitleEdit.exe > SubtitleEdit.log
Have I missed an obvious step somewhere?
(Also, love the program overall and it works perfect for me in Windows, I just prefer Linux.)
@cecoates From your Mono log file,
Mono: process_wait (0x563ca53d26f8, 7990): PID: 4200
Mono: process_wait (0x563ca53d26f8, 7990): waiting on semaphore for 7990 ms...
Mono: process_wait (0x563ca53d26f8, 7990): Waited successfully
Mono: process_wait (0x563ca53d26f8, 7990): Setting pid 4200 signalled, exit status 1
Apparently, tesseract has terminated abnormally (signalled). In fact, this happens every time tesseract is run by SE — each "Waited successfully" is followed by "signalled, exit status 1".
Try running tesseract from the shell (bash/zsh) like this,
/usr/bin/tesseract some-image.png tess-result -l eng --oem 0 hocr
Where some-image.png
is the input file you need to provide (if necessary, you can use Export > BDN xml/png in the OCR/Import dialogue), and the result (when tesseract terminates normally) will be written to tess-result.hocr
. Engine mode 0 is the legacy engine (original tesseract in SE).
@xylographe off topic, but have you been able to run a debugger for SE on linux? I've been trying to get a debugger to work for this snap packaging thing that I've been working on and none of the extensions that support remote mono debugging (like VSMonoDebugger) for visual studio seem to work
Never mind, I figured it out.
For reference:
I used Visual Studio, VSMonoDebugger, and a VM running Ubuntu.
First, use the deploy option on VSMonoDebugger to put it onto the Ubuntu VM (configure the IP address first); this also generates the required mdb files.
Then, on the ubuntu VM, run mono SubtitleEdit.exe --debugger-agent=address=0.0.0.0:11000,transport=dt_socket,server=y --debug=mdb-optimization
Then, use the attach option on VSMonoDebugger.
@lxs602 I believe that using this snap package that I created will resolve the issues that you're having with Tesseract: #3952 Can you test it out and report back?
@cecoates From your Mono log file,
Mono: process_wait (0x563ca53d26f8, 7990): PID: 4200 Mono: process_wait (0x563ca53d26f8, 7990): waiting on semaphore for 7990 ms... Mono: process_wait (0x563ca53d26f8, 7990): Waited successfully Mono: process_wait (0x563ca53d26f8, 7990): Setting pid 4200 signalled, exit status 1
Apparently, tesseract has terminated abnormally (signalled). In fact, this happens every time tesseract is run by SE — each "Waited successfully" is followed by "signalled, exit status 1".
Try running tesseract from the shell (bash/zsh) like this,
/usr/bin/tesseract some-image.png tess-result -l eng --oem 0 hocr
Where
some-image.png
is the input file you need to provide (if necessary, you can use Export > BDN xml/png in the OCR/Import dialogue), and the result (when tesseract terminates normally) will be written totess-result.hocr
. Engine mode 0 is the legacy engine (original tesseract in SE).
Argh. I did two things wrong.
1) I didn't set the "TESSDATA_PREFIX environment variable".
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
2) For some reason when I used wget the download got cutoff at ~60kb.
I manually downloaded:
https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata
Then
cd ~/Downloads && sudo mv *.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
Now it's working. Thanks @xylographe and @lxs602
Because I'm trying to wrap my head around it, this is... Tesseract 4 using the Tesseract 3 data?
AFAIK, /usr/share/tesseract-ocr/4.00/tessdata
is the standard location on Ubuntu. SE will also (when running on Linux) use that directory, if it exists. Hence, setting TESSDATA_PREFIX
should not be necessary. Note that Tesseract will use the value of TESSDATA_PREFIX
, but SE does not!
Corrupted *`.traineddata`** files OTOH are probably a good reason for a tesseract crash. :)
Hi,
I have tried using Subtitleedit 3.5.11 and 3.5.11 Beta, on Ubuntu 18.04 amd64.
When using Tesseract 4 on a Matroska file, no text is detected and and only blank orange lines are produced.
I can upload debug files, or patch, or supply the video files I used if that helps.
Thanks.