SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.36k stars 892 forks source link

Tesseract OCR fails on Ubuntu 18.04 #3851

Closed lxs602 closed 2 years ago

lxs602 commented 4 years ago

Hi,

I have tried using Subtitleedit 3.5.11 and 3.5.11 Beta, on Ubuntu 18.04 amd64.

When using Tesseract 4 on a Matroska file, no text is detected and and only blank orange lines are produced.

I can upload debug files, or patch, or supply the video files I used if that helps.

Thanks.

lxs602 commented 4 years ago

Tesseract packages installed:

dpkg -l | grep tesseract

ii gimagereader 3.2.3-2 amd64
ii libtesseract-dev 4.00~git2288-10f4998a-2 amd64
ii libtesseract4 4.00~git2288-10f4998a-2 amd64
ii` tesseract-ocr 4.00~git2288-10f4998a-2 amd64
ii tesseract-ocr-eng 4.00~git24-0e00fe6-1.2 all
ii tesseract-ocr-ita 4.00~git24-0e00fe6-1.2 all
ii tesseract-ocr-osd 4.00~git24-0e00fe6-1.2 all

EDIT: I'm not sure why some of the lines above are strikethrough formatted.

xylographe commented 4 years ago

Release 4.00~git2288-10f4998a-2 was published on 2018-04-20. Perhaps the current stable release is required?

To avoid the strike through, put a line with exactly three backticks (```) before and after your text.
lxs602 commented 4 years ago

Interestingly, if I start at the very last line, tesseract seems to work.

Also, when this is done, I can then get it to read the second-to-last line.

It seems to work from bottom-to-top this way.

EDIT: I hope this makes sense... I will try to rewrite it if not. See the image below: https://pasteboard.co/IInTR8iw.png

lxs602 commented 4 years ago

I have just installed the latest ppa (4.1.0+git4239) from ppa:alex-p/tesseract-ocr, with no change.

dpkg -l | grep tesseract:

ii libtesseract-dev:amd64 4.1.0+git4239 -6343f0ab-1ppa1~bionic1 amd64
ii libtesseract4:amd64 4.1.0+git4239 -6343f0ab-1ppa1~bionic1 amd64 ii tesseract-ocr 4.1.0+git4239 -6343f0ab-1ppa1~bionic1 amd64 ii tesseract-ocr-eng 1:4.0.0+git39-6572757-1ppa1~bionic1 all ii tesseract-ocr-ita 1:4.0.0+git39-6572757-1ppa1~bionic1 all
ii tesseract-ocr-osd 1:4.0.0+git39-6572757-1ppa1~bionic1 all

lxs602 commented 4 years ago

Is there a way to enable tesseract debug to see the errors, when using it with SubtitleEdit?

dausruddin commented 4 years ago

I had the same problem. tried a few things and still couldn't get it to work. I ended up using SE on Windows lol.

lxs602 commented 4 years ago

It works on wine-staging if you install Dot Net 4.62 using winetricks (https://wiki.winehq.org/Winetricks), but I would rather use it natively than on wine if I can.

EDIT: Also had to set wine to Windows 2003 using winecfg.

lxs602 commented 4 years ago

It must be what command SE is passing to Tesseract. Is it possible to enable debug output with SE/Tesseract?

If you OCR only the last line it works (see post 4 above).

I will try compiling Tesseract rather than using the PPA and see what happens.

xylographe commented 4 years ago

@lxs602 You might try SE-3.5.11-issue3851.7z from this DropBox folder. It writes the invoked Tesseract command and some other info to a log file (subtitle-edit.log). Perhaps, it can help you to figure out what's going wrong.

lxs602 commented 4 years ago

Hi, this crashes instead of generating orange blank lines. EDIT: I found the log file... on the Desktop.

xylographe commented 4 years ago

What does "this crashes" mean? Does it throw an exception? Is there an exception message?

xylographe commented 4 years ago

@lxs602 I've updated SE-3.5.11-issue3851.7z. New patch is more robust. Log file lines are numbered. If you sort the log file when OCR has finished, then the lines associated with the same image will be in sequential order.

xylographe commented 4 years ago

Updated SE-3.5.11-issue3851.7z again. Two bugs have been fixed.

NickZ commented 4 years ago

@xylographe what did you change in this build?

lxs602 commented 4 years ago

@xylographe, I'm sorry, my question was stupid. For debug output I should do (?):

MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" mono SubtitleEdit.exe > SubtitleEdit.log

Thank you for helping though. The patched version crashes and does not give an exception error. Maybe it does not matter, as I can use the command above for debug?

But just so you know, the patched version gives this error using MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all":

[ERROR] FATAL UNHANDLED EXCEPTION: System.InvalidOperationException: Process must exit before requested information can be determined. at System.Diagnostics.Process.EnsureState (System.Diagnostics.Process+State state) [0x0008d] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at System.Diagnostics.Process.get_ExitCode () [0x00000] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at (wrapper remoting-invoke-with-check) System.Diagnostics.Process.get_ExitCode() at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractRunner.Run (System.String languageCode, System.String psmMode, System.String engineMode, System.String imageFileName, System.Boolean run302) [0x002c2] in <288f00de052a4d5c9bffcee795eebce7>:0 at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractThreadRunner.DoOcr (System.Object j) [0x00034] in <288f00de052a4d5c9bffcee795eebce7>:0 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context (System.Object state) <0x7fdaf2f4cf90 + 0x0004b> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f44a80 + 0x0014d> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f449c0 + 0x00041> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () <0x7fdaf2f4cf00 + 0x00046> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00074] in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () <0x7fdaf2f4cdd0 + 0x00018> in <285579f54af44a2ca048dad6be20e190>:0 [ERROR] FATAL UNHANDLED EXCEPTION: System.InvalidOperationException: Process must exit before requested information can be determined. at System.Diagnostics.Process.EnsureState (System.Diagnostics.Process+State state) [0x0008d] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at System.Diagnostics.Process.get_ExitCode () [0x00000] in <2703bbaa0a6e4686b6033c2dddb1a363>:0 at (wrapper remoting-invoke-with-check) System.Diagnostics.Process.get_ExitCode() at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractRunner.Run (System.String languageCode, System.String psmMode, System.String engineMode, System.String imageFileName, System.Boolean run302) [0x002c2] in <288f00de052a4d5c9bffcee795eebce7>:0 at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractThreadRunner.DoOcr (System.Object j) [0x00034] in <288f00de052a4d5c9bffcee795eebce7>:0 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context (System.Object state) <0x7fdaf2f4cf90 + 0x0004b> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f44a80 + 0x0014d> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) <0x7fdaf2f449c0 + 0x00041> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () <0x7fdaf2f4cf00 + 0x00046> in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading.ThreadPoolWorkQueue.Dispatch () [0x00074] in <285579f54af44a2ca048dad6be20e190>:0 at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback () <0x7fdaf2f4cdd0 + 0x00018> in <285579f54af44a2ca048dad6be20e190>:0

lxs602 commented 4 years ago

Using the unpatched SubtitleEdit, here are the full debug logs:

MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" mono SubtitleEdit.exe > SubtitleEdit.log

Blank lines using OCR: https://gofile.io/?c=yQDc1T

Successful OCR (but only if starting at the last line): https://gofile.io/?c=p4HnSu

lxs602 commented 4 years ago

I am not sure what the problem is. I did notice:

Mono: [0x7fe7bed36700] try unpark worker Mono: [0x7fe7bed36700] try unpark worker, success? no Mono: [0x7fe7bed36700] try create worker Mono: [0x7fe7beb35700] worker starting Mono: [0x7fe7beb35700] worker executing Mono: [0x7fe7beb35700] worker running in domain 0x556d3561a080 (outstanding requests 0) Mono: [0x7fe7bed36700] try create worker, created 0x7fe7beb35700, now = 8091 count = 2 Mono: [0x7fe7bed36700] request worker, created Mono: AOT: FOUND method System.Random:Next (int) [0x7fe7de294880 - 0x7fe7de294910 0x7fe7de77b296] Mono: AOT: FOUND method System.Random:Sample () [0x7fe7de294420 - 0x7fe7de294450 0x7fe7de77b26f] Mono: AOT: FOUND method System.Threading.QueueUserWorkItemCallback:System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () [0x7fe7de34df00 - 0x7fe7de34df80 0x7fe7de77febd] Mono: [0x7fe7beb35700] worker parking Mono: AOT: FOUND method System.Threading.Timer/Scheduler:TimerCB (object) [0x7fe7de355a40 - 0x7fe7de355b50 0x7fe7de7801ed] Mono: AOT NOT FOUND: (wrapper remoting-invoke-with-check) System.Threading.Timer:Dispose (). Mono: AOT: FOUND method System.Threading.Timer:Dispose () [0x7fe7de354a10 - 0x7fe7de354a50 0x7fe7de780140] Mono: event_create: creating Event handle Mono: mono_w32handle_new: create Event handle 0x556d3560a538 Mono: mono_w32handle_ref_core: ref Event handle 0x556d3560a538, ref: 1 -> 2 Mono: event_handle_create: created Event handle 0x556d3560a538 Mono: mono_w32handle_unref_core: unref Event handle 0x556d3560a538, ref: 2 -> 1 destroy: false Mono: AOT: FOUND method System.Threading.ExecutionContext:IsFlowSuppressed () [0x7fe7de346660 - 0x7fe7de3466f0 0x7fe7de77fc20] Mono: AOT: FOUND method System.Threading.ExecutionContext:Capture () [0x7fe7de3466f0 - 0x7fe7de346730 0x7fe7de77fc24] Mono: AOT: FOUND method System.Runtime.InteropServices.GCHandle:Alloc (object) [0x7fe7de4892c0 - 0x7fe7de489300 0x7fe7de7896f7] Mono: AOT NOT FOUND: (wrapper managed-to-native) System.Threading.ThreadPool:NotifyWorkItemComplete (). Mono: AOT: FOUND method System.Runtime.InteropServices.GCHandle:op_Explicit (intptr) [0x7fe7de4893c0 - 0x7fe7de489450 0x7fe7de7896fb] Mono: AOT NOT FOUND: (wrapper managed-to-native) System.Runtime.InteropServices.GCHandle:CheckCurrentDomain (int). Mono: [0x7fe7bed36700] hill climbing, change max number of threads 4 Mono: [0x7fe7bed36700] worker parking Mono: AOT: FOUND method System.Threading.ExecutionContext:Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object) [0x7fe7de345930 - 0x7fe7de3459c0 0x7fe7de77fbdc] Mono: AOT: FOUND method System.Threading.ExecutionContext:OnAsyncLocalContextChanged (System.Threading.ExecutionContext,System.Threading.ExecutionContext) [0x7fe7de6e2080 - 0x7fe7de6e2431 0x7fe7de77fb9b] Mono: AOT: FOUND method System.Delegate:DynamicInvoke (object[]) [0x7fe7de313b00 - 0x7fe7de313b40 0x7fe7de77e659] Mono: AOT: FOUND method System.MulticastDelegate:DynamicInvokeImpl (object[]) [0x7fe7de318c60 - 0x7fe7de318d10 0x7fe7de77e917] Mono: mono_w32handle_ref_core: ref Event handle 0x556d3560a538, ref: 1 -> 2 Mono: ves_icall_System_Threading_Events_SetEvent_internal: setting Event handle 0x556d3560a538 Mono: mono_w32handle_unref_core: unref Event handle 0x556d3560a538, ref: 2 -> 1 destroy: false Mono: AOT: FOUND method System.Runtime.InteropServices.GCHandle:Free () [0x7fe7de489350 - 0x7fe7de4893b0 0x7fe7de7896f9] Mono: AOT NOT FOUND: (wrapper managed-to-native) System.Runtime.InteropServices.GCHandle:FreeHandle (int). Mono: DllImport searching in: 'libX11.so.6' ('libX11.so.6'). Mono: Searching for 'Xutf8ResetIC'. Mono: Probing 'Xutf8ResetIC'. Mono: Found as 'Xutf8ResetIC'. Mono: DllImport searching in: 'libX11.so.6' ('libX11.so.6'). Mono: Searching for 'XUnsetICFocus'. Mono: Probing 'XUnsetICFocus'. Mono: Found as 'XUnsetICFocus'.

xylographe commented 4 years ago

Thank you, @lxs602

The first DEBUG output (running the patched SE) is rather informative. It contains a stack trace that should be read in reverse order (from bottom to top). Note that ImageJob is a class that contains all information to convert an image to text via tesseract(1).

The interesting part of the stack trace starts with

at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractThreadRunner.DoOcr (System.Object j)

Meaning that a previously scheduled ImageJob (ThreadPool.QueueUserWorkItem(DoOcr, job) in TesseractThreadRunner.cs) is being started.

at Nikse.SubtitleEdit.Logic.Ocr.Tesseract.TesseractRunner.Run (System.String languageCode, System.String psmMode, System.String engineMode, System.String imageFileName, System.Boolean run302)

The Run() method of a new TesseractRunner instance is invoked. This is where the actual conversion will be performed.

at System.Diagnostics.Process.get_ExitCode ()

Tesseract(1) has been run (its invocation should be logged in subtitle-edit.log). We have reached line 90 (TesseractRunner.cs) right after process.WaitForExit(8000) finished. Line 90 logs the Process.HasExited and Process.ExitCode properties, which should tell us if running tesseract(1) succeeded or failed.

at System.Diagnostics.Process.EnsureState (System.Diagnostics.Process+State state)

Check that the Process has exited, otherwise there won't be an ExitCode.

FATAL UNHANDLED EXCEPTION: System.InvalidOperationException: Process must exit before requested information can be determined.

Apparently, the process did not yet terminate, therefore accessing the ExitCode property throws an exception, but because it is UNHANDLED, no message box is shown. Under Windows this non-UI exception would be appended to the Windows event log, and the system default handler would report the exception to the user before terminating the application.

First conclusion, I made a mistake. Instead of

Log(id, $"hasexited=|{process.HasExited}| exitcode=|{process.ExitCode}|");

I should have written

if (process.HasExited)
{
    Log(id, $"hasexited=|{process.HasExited}| exitcode=|{process.ExitCode}|");
}
else
{
    Log(id, $"hasexited=|{process.HasExited}|");
}

Second (much more important) conclusion, the OCR failure (empty text lines) is most likely caused by tesseract(1) taking more than 8000 milliseconds to finish. A possible quick solution would be to increase the maximum wait time (15 seconds perhaps).

I updated SE-3.5.11-issue3851.7z again. Its source code is in the xg/issue3851 branch.

Please, check if increasing the Process.WaitForExit() time delay solves the Tesseract OCR problem. Should you want to inspect subtitle-edit.log, remember to sort the file first (e.g. sort subtitle-edit.log >subtitle-edit.sorted.log).

lxs602 commented 4 years ago

Hi, no fix, unfortunately... logs are below. Error seems to be:

0077 000461 [parsing out-file] "<span class='ocr_line'" or "<span class='ocr_header'" not found

Subtitle-edit.log https://gofile.io/?c=vnhNdu

Mono debug: https://gofile.io/?c=xutyXY

xylographe commented 4 years ago

From SE log:

0001 000001 [Tesseract OCR] image-file=|/tmp/52a9e267-5701-4a04-947e-e2ff904d77b8.png|
0001 000002 cwd=|| file=|tesseract| args=|"/tmp/52a9e267-5701-4a04-947e-e2ff904d77b8.png" "/tmp/d02e91a6-a86e-46e2-8d88-9d1c7588dfe4" -l eng --oem 3 hocr|

Start of first ImageJob. From MONO log:

Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/52a9e267-5701-4a04-947e-e2ff904d77b8.png" "/tmp/d02e91a6-a86e-46e2-8d88-9d1c7588dfe4" -l eng --oem 3 hocr]

Confirmed by mono. From SE log:

0001 000115 hasexited=|False|

Same problem as before: tesseract did not exit yet (after 30 seconds). Apparently, tesseract is blocking. As I don't know exactly why, I made several changes in TesseractRunner.Run(), that may solve this problem. I also noticed too many (57) invocations of tesseract running in parallel.

Because the updated SE-3.5.11-issue3851.7z is rather slow, do not convert a complete subtitle, only the last 7-10 lines. In the worst case scenario that could still take 5-6 minutes.

lxs602 commented 4 years ago

Excellent, it worked. OCR was rapid, so no need for a 30 second timeout.

Mono log: https://gofile.io/?c=fIOeNw

Subtitle-edit log: https://gofile.io/?c=wVZOwq

The only thing now though is when using 'Prompt for unknown words', spell check has lost some options... it no longer allows Adding to Name List / Skip One / etc. https://gofile.io/?c=QxRswU

xylographe commented 4 years ago

In this update tesseract concurrency has been restored, albeit with a limit of 11 concurrent invocations to prevent unpleasant surprises. When testing, please, disable ‘Prompt for unknown words’.

‘Spell check’ dialogues (word-only/whole-text) when running on Windows:

SpellCheckWord

SpellCheckText

lxs602 commented 4 years ago

Hi, Tesseract exited with a timeout error, unfortunately. Testing was with 'Prompt for Unknown Words' disabled.

I noticed that SubtitleEdit was very slow after upgrading to mono 6.6.0.161. I have reverted to mono 6.4.0-198, and then 6.6.0, which were not slow.

Mono 6.4.0-198 Mono log: https://gofile.io/?c=E5S6tO

SubtitleEdit log: https://gofile.io/?c=P8eo1J

Mono 6.6.0 Mono log: https://gofile.io/?c=eUZluO

SubtitleEdit log: https://gofile.io/?c=WcwKkA

Mono 6.6.0-161 Mono log: https://gofile.io/?c=tZaYHU

NickZ commented 4 years ago

@lxs602 how did you revert back to Mono 6.6.0? I'm currently stuck at 6.7 because the official mono repo doesn't offer older versions.

xylographe commented 4 years ago

Excellent! We have now narrowed it down to two possible causes. I'm hoping this update will yield the desired result. If not, there is only one option left.

lxs602 commented 4 years ago

@NickZ

You can install specific snapshots from the Mono repository. On Ubuntu/Debian, you would edit /etc/apt/sources.list.d/mono-<_name-may-vary_>.list :

and change the name of your distribution, and specify the directory for the version in the repo, e.g. from: deb http://download.mono-project.com/repo/ubuntu bionic main to: deb http://download.mono-project.com/repo/ubuntu bionic/snapshots/6.6.0 main

Then delete (and purge) all current installed mono files, apt-get update, and reinstall. I followed this guide here: https://stackoverflow.com/questions/33763177/install-older-version-of-mono.

@xylographe

Out of interest, what are they?

xylographe commented 4 years ago

@lxs602 A change of strategy, as we're almost at the finish. I have uploaded three archives at once:

Try SE-3.5.11-issue3851C.7z first. If it works, it will become the final solution (after removing logging, of course).

Try SE-3.5.11-issue3851B.7z only if SE-3.5.11-issue3851C.7z didn't work. The only difference is the way tesseract's standard error is processed (in [B] it's send to /dev/null).

Try SE-3.5.11-issue3851A.7z only if SE-3.5.11-issue3851B.7z didn't work. The difference is a Process.Refresh(), which shouldn't be necessary, but you never know.

The cause of Tesseract failure on Ubuntu is that tesseract gets blocked when several instances are being run concurrently. This might indicate a dead-lock while acquiring resources. The short-time solution is to start tesseract instances sequentially.

lxs602 commented 4 years ago

Hi, 'C' was successful, but very slow. There was no noticeable difference between 'A' and 'B'.

Log files for each of the three versions are below: https://gofile.io/?c=HzJxXr

I'm not sure there was much difference in speed from when tesseract was running in single instance though... it may even have seemed a little slower.

Do you still have the previous version to test against, from the comment below?: https://github.com/SubtitleEdit/subtitleedit/issues/3851#issuecomment-562847141

xylographe commented 4 years ago

Do you still have the previous version to test against, …

Sort of. :) It is either SE-3.5.11-issue3851D.7z or SE-3.5.11-issue3851E.7z I assume you are asking because it was the fastest of the lot: (D or E) 208 (A) 759 (B) 283 (C) 1816 [averages in milliseconds].

The difference between B and C is understandable. C processes tesseract's stderr (socket), B discards (/dev/null) all output from tesseract. The difference between A and B is inexplicable, the tiny modification in the code happens after the end time has been determined: in both A and B the start time is set immediately before Process.Start(), the end time is set immediately after Process.WaitForExit(), the code in between is exactly the same.

lxs602 commented 4 years ago

Hi, I tested all five versions using the same mkv file. There was less noticeable difference between them this time without other programs running in the background. https://gofile.io/?c=NUE0It

I then tried another, longer, mkv file, which generated an 'out of memory' error in all five versions (the log files are large): https://gofile.io/?c=BCCHcs

Is there any further way adjust the invoking of Tesseract, so that multiple instances (A/B) complete OCR more quickly than in single instance (D/E)?

If there are any other debug techniques I can do to help, let me know. I am aware of monodevelop, and others, though not being a programmer I have not used them.

xylographe commented 4 years ago

Running tesseract successfully: (from Mono log)

Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/e9e9ba1a-a075-44a8-8cca-0e97e7c259fe.png" "/tmp/66ff7f55-2455-4252-9ac9-39434b046751" -l eng --oem 3 hocr]
Mono: mono_w32handle_new: create Process handle 0x56381d73e548
Mono: mono_w32handle_ref_core: ref Process handle 0x56381d73e548, ref: 1 -> 2
Mono: mono_w32handle_ref_core: ref Process handle 0x56381d73e548, ref: 2 -> 3
Mono: mono_w32handle_unref_core: unref Process handle 0x56381d73e548, ref: 3 -> 2 destroy: false
Mono: process_create: returning handle 0x56381d73e548 for pid 9958

A new process has been created, pid 9958 is the fork() return value in the parent.

Mono: process_wait (0x56381d73e548, 29733): PID: 9958
Mono: process_wait (0x56381d73e548, 29733): waiting on semaphore for 29733 ms...
Mono: process_wait (0x56381d73e548, 29733): Waited successfully
Mono: process_wait (0x56381d73e548, 29733): Setting pid 9958 signalled, exit status 0

Tesseract has finished successfully.

Mono: mono_w32handle_unref_core: unref Process handle 0x56381d73e548, ref: 1 -> 0 destroy: true
Mono: w32handle_destroy: destroy Process handle 0x56381d73e548
Mono: process_close
Mono: processes_cleanup
Mono: processes_cleanup done

At this point all resources used for this process should have been released.

When things go wrong:

Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/f20151bd-5d65-4935-bc89-97f952c7857b.png" "/tmp/6a473ef4-578d-4c7e-b318-345877a5d385" -l eng --oem 3 hocr]
Mono: process_create: returning handle (nil) for pid -1

The fork() return value is minus one with errno set to ENOMEM (or perhaps EAGAIN). This could, for example, happen if too many zombie children are waiting to be reaped. However, it seems rather unlikely that such a far-reaching bug would have stayed unnoticed for such a long time. But, whatever the cause, I'm afraid there's nothing SE can do to avoid it. Nonetheless, I'll try to limit OCR memory use in SE as much as possible in SE-3.5.11-issue3851F.7z.

I created a large (9545 images) VobSub and processed it with SE-3.5.11-issue3851F. It took an awfully long time on Windows (almost 500ms per image, compared to less than 200ms on Linux), but eventually it did finish without running out of memory, or other exceptions.

This time try running mono without MONO_LOG_LEVEL and MONO_LOG_MASK, and after SE has started, immediately select "Import from Matroska file".

NickZ commented 4 years ago

@lxs602 have you been able to successfully choose "Tessaract only (can do italics)"? I have not been able to on Linux. Which version of Tessaract are you using? The one from the PPA?

lxs602 commented 4 years ago

@NickZ, I recently reinstalled Ubuntu after breaking some settings, and upgraded to Ubuntu 19.10. I had Tesseract from the PPA on Ubuntu 18.04, but I am now using the same version as from the PPA, but present in the normal repository on 19.10.

Can you choose "Tesseract 4" instead of "Binary image compare", under OCR method?

What Linux distribution are you using?

A screenshot of the SubtitleEdit running on my computer is below: https://gofile.io/?c=FJIhmF

@xylographe, I have tried the new version which ran fine. I suppose that is all for invoking Tesseract and for managing memory? https://gofile.io/?c=NjUHrU

There was a small error on trying to close SE. It then closes on the second attempt without an error message. I ran SE again with mono debugging to capture the error (see also the screenshots): https://gofile.io/?c=VXEf3v

xylographe commented 4 years ago

@lxs602 The error (exception) is caused by a missing assembly (System.Web.Services).

From a previous Mono log:

Mono: [...] looking for System.Web.Services, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a
Mono: Assembly Loader probing location: '/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll'.
Mono: Image addref System.Web.Services[0x562fa4d16610] (asmctx DEFAULT) -> /usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll[0x562fd962f0c0]: 2
Mono: Prepared to set up assembly 'System.Web.Services' (/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll)
Mono: Assembly System.Web.Services[0x562fa4d16610] added to domain SubtitleEdit.exe, ref_count=1
Mono: Assembly Loader loaded assembly from location: '/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll'.

The System.Web.Services assembly was found in the GAC.

From the last Mono log:

Mono: Assembly Loader probing location: '/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll'.
Mono: The following assembly [...] could not be loaded:
     Assembly:   System.Web.Services    (assemblyref_index=3)
     Version:    4.0.0.0
     Public Key: b03f5f7f11d50a3a
The assembly was not found in the Global Assembly Cache, a path listed in the MONO_PATH environment variable, or in the location of the executing assembly.

File /usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll no longer exists. Perhaps, it was removed while upgrading/downgrading Mono?

I have uploaded SE-3.5.11-issue3851H.7z. I expect this (minus the logging, of course) to be the final version. In case this version doesn't work on Ubuntu, I provided fallback version SE-3.5.11-issue3851G.

Merry Christmas, Waldi

lxs602 commented 4 years ago

Hi,

/usr/lib/mono/gac/System.Web.Services/4.0.0.0__b03f5f7f11d50a3a/System.Web.Services.dll no longer exists. Perhaps, it was removed while upgrading/downgrading Mono?

Sorry, I hadn't installed libmono-system-web-services4.0-cil when I installed Ubuntu again. It was installed as a dependency of mono-devel.

I tested both new versions, using G and then H with Film2, for comparison. https://gofile.io/?c=glZheG

Both G and H performed well.

I then used H with Film1 while the system was under heavy load.

On H, Tesseract struggled a bit under a loaded system, and gave a few timeout errors. Perhaps you would want to prompt an error dialogue, to say (something like): 'OCR complete, tesseract timed out on lines x,y and z, try running as a higher priority or switching off other applications'?

With the 'Prompt for Unknown Words' option, I found that pressing 'Edit whole text', and then 'Edit word only', brought back all the normal options.

Thank you for helping out and for your time.

Happy new year. L

xylographe commented 4 years ago

New version without logging in SE-3.5.11-issue3851.7z.

NickZ commented 4 years ago

@xylographe thanks for the update; this seems to be working a lot better now. I finally fixed the problem with running on "Tesseract only", which was a problem with Tesseract itself. However, when selecting "Tesseract only (can detect italics)", it seems to end up skipping a lot of lines. Rerunning it on those lines fixes those, so I think that concurrency may be to blame? Is there a way to turn off concurrency when Tesseract only is selected?

lxs602 commented 4 years ago

Hi, I just tried each of the four Engine Modes. I had left it with 'Default' so far.

Character recognition worked the same on 'Neural Nets LTSM only' as with 'Default, based on what is available'. However, only blank orange lines were produced on 'Original Tesseract Only' and on 'Tesseract + LTSM'.

This would lead me to believe that 'Default' is using only 'Neural Nets LTSM', and since they both gave the message, 'Invalid Resolution 0 dpi'. Conversely, Tesseract seems not to be working, by itself or with LTSM.

Again out of interest, what are Neural Nets and LTSM? When I searched for them I retrieved information on Deep Learning. Is this an alternative implementation of Tesseract, or something else?

I have attached a few logs again using H. https://gofile.io/?c=5Wp72J

@NickZ, why don't you upload some error logs?

xylographe commented 4 years ago

@NickZ SE-3.5.11-issue3851 never invokes tesseract more than once.

@lxs602 To use Original Tesseract you need traineddata from the ‘legacy’ set. See this comment and this comment.

xylographe commented 4 years ago

@lxs602 I've examined the logs. Because you are using traineddata from the tessdata-fast set (Ubuntu default set) only LSTM can work. Hence, Default must choose LSTM as nothing else is available. When forced to use Tesseract only (or Tesseract+LSTM), no output file is generated.

Did you get the 'Invalid Resolution 0 dpi' warning for each image?

I know next to nothing about neural networks, but apparently Neural Networks and Deep Learning provides a good general introduction. More specific info about neural net OCR is available in Using Neural Networks for Optical Character Recognition.

lxs602 commented 4 years ago

Hi,

I found a few articles suitable for lay readers, such as below.

In summary, 'LTSM and Neural Networks' is the new engine, and 'Original Tesseract' is the legacy engine.

https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/ https://open.live.bbc.co.uk/mediaselector/6/redir/version/2.0/mediaset/audio-nondrm-download/proto/https/vpid/p07sy4sz.mp3

I got 'Invalid Resolution 0 dpi' only on lines 1 and 2.

I have uploaded the file for the video below if it helps. https://gofile.io/?c=3XNUOV

cecoates commented 4 years ago

@lxs602 I've only ever been able to use LTSM via Mono (or the "based on what's available" setting, which as you said looks like it's LTSM). Reviewing this thread it sounded like earlier you were actually using Tesseract though.

Is that the case, or were you using only LTSM the entire time?

If you did get it working, would you be able to summarize the steps you took? I've tried installing the dependencies/tesseract/etc. and no joy.

(Also, re: the OCR going backwards/jumping around, I wonder if some of the improvements in the neural network translations are because it tries to use the context of other lines to improve its recognition.)

lxs602 commented 4 years ago

@cecoates, Hi, I was using 'Default' for all the comments above (unless specified otherwise), which appeared to have only been using LTSM.

I have just tried Tesseract legacy, using the best-data set which apparently it needs, and it seems to work well.

I used the steps below, with help from the comment above, for the English language:

cd /usr/share/tesseract-ocr/4.00/tessdata/
sudo mkdir fastdata_backup
sudo mv *.traineddata fastdata_backup
sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata
cecoates commented 4 years ago

I must be doing something wrong/missing an important step. Now even just neural net isn't working for me.

I installed these packages: https://www.nikse.dk/SubtitleEdit/Help#linux

Then downloaded the portable and beta Subtitle Edit: https://github.com/SubtitleEdit/subtitleedit/releases

Follow the steps you mentioned:

cd /usr/share/tesseract-ocr/4.00/tessdata/

sudo mkdir fastdata_backup

sudo mv *.traineddata fastdata_backup

sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

sudo wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata

Extract the Subtitle Edit zips, and used mono to run them: mono SubtitleEdit.exe

Import an MKV/PGS/.sup file.

Change OCR method to Tesseract 4.

Language: English.

Original Tesseract Only.

Start OCR.

Then I get hit with the error/popup

Tessaract returned with code 1

And it pops up twice.

Then I see nothing but blank orange lines.

However, originally for me the Neural Nets method worked, but now that causes "Tesseract returned with code 1" errors too:

2020-01-24_18-09

Attaching log from using:

MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" mono SubtitleEdit.exe > SubtitleEdit.log

Have I missed an obvious step somewhere?

SubtitleEdit.log

(Also, love the program overall and it works perfect for me in Windows, I just prefer Linux.)

xylographe commented 4 years ago

@cecoates From your Mono log file,

Mono: process_wait (0x563ca53d26f8, 7990): PID: 4200
Mono: process_wait (0x563ca53d26f8, 7990): waiting on semaphore for 7990 ms...
Mono: process_wait (0x563ca53d26f8, 7990): Waited successfully
Mono: process_wait (0x563ca53d26f8, 7990): Setting pid 4200 signalled, exit status 1

Apparently, tesseract has terminated abnormally (signalled). In fact, this happens every time tesseract is run by SE — each "Waited successfully" is followed by "signalled, exit status 1".

Try running tesseract from the shell (bash/zsh) like this,

/usr/bin/tesseract some-image.png tess-result -l eng --oem 0 hocr

Where some-image.png is the input file you need to provide (if necessary, you can use Export > BDN xml/png in the OCR/Import dialogue), and the result (when tesseract terminates normally) will be written to tess-result.hocr. Engine mode 0 is the legacy engine (original tesseract in SE).

NickZ commented 4 years ago

@xylographe off topic, but have you been able to run a debugger for SE on linux? I've been trying to get a debugger to work for this snap packaging thing that I've been working on and none of the extensions that support remote mono debugging (like VSMonoDebugger) for visual studio seem to work

NickZ commented 4 years ago

Never mind, I figured it out.

For reference:

I used Visual Studio, VSMonoDebugger, and a VM running Ubuntu. First, use the deploy option on VSMonoDebugger to put it onto the Ubuntu VM (configure the IP address first); this also generates the required mdb files. Then, on the ubuntu VM, run mono SubtitleEdit.exe --debugger-agent=address=0.0.0.0:11000,transport=dt_socket,server=y --debug=mdb-optimization Then, use the attach option on VSMonoDebugger.

NickZ commented 4 years ago

@lxs602 I believe that using this snap package that I created will resolve the issues that you're having with Tesseract: #3952 Can you test it out and report back?

cecoates commented 4 years ago

@cecoates From your Mono log file,

Mono: process_wait (0x563ca53d26f8, 7990): PID: 4200
Mono: process_wait (0x563ca53d26f8, 7990): waiting on semaphore for 7990 ms...
Mono: process_wait (0x563ca53d26f8, 7990): Waited successfully
Mono: process_wait (0x563ca53d26f8, 7990): Setting pid 4200 signalled, exit status 1

Apparently, tesseract has terminated abnormally (signalled). In fact, this happens every time tesseract is run by SE — each "Waited successfully" is followed by "signalled, exit status 1".

Try running tesseract from the shell (bash/zsh) like this,

/usr/bin/tesseract some-image.png tess-result -l eng --oem 0 hocr

Where some-image.png is the input file you need to provide (if necessary, you can use Export > BDN xml/png in the OCR/Import dialogue), and the result (when tesseract terminates normally) will be written to tess-result.hocr. Engine mode 0 is the legacy engine (original tesseract in SE).

Argh. I did two things wrong.

1) I didn't set the "TESSDATA_PREFIX environment variable". export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata 2) For some reason when I used wget the download got cutoff at ~60kb. I manually downloaded: https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata

Then cd ~/Downloads && sudo mv *.traineddata /usr/share/tesseract-ocr/4.00/tessdata/ Now it's working. Thanks @xylographe and @lxs602

Because I'm trying to wrap my head around it, this is... Tesseract 4 using the Tesseract 3 data?

xylographe commented 4 years ago

AFAIK, /usr/share/tesseract-ocr/4.00/tessdata is the standard location on Ubuntu. SE will also (when running on Linux) use that directory, if it exists. Hence, setting TESSDATA_PREFIX should not be necessary. Note that Tesseract will use the value of TESSDATA_PREFIX, but SE does not!

Corrupted *`.traineddata`** files OTOH are probably a good reason for a tesseract crash. :)