SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.77k stars 910 forks source link

Tesseract OCR fails on Ubuntu 18.04 #3851

Closed lxs602 closed 2 years ago

lxs602 commented 4 years ago

Hi,

I have tried using Subtitleedit 3.5.11 and 3.5.11 Beta, on Ubuntu 18.04 amd64.

When using Tesseract 4 on a Matroska file, no text is detected and and only blank orange lines are produced.

I can upload debug files, or patch, or supply the video files I used if that helps.

Thanks.

lxs602 commented 4 years ago

Hi, has this been uploaded into the current version? Or will that be a future release? I gave 3.5.13 a try but it didn't seem so.

dausruddin commented 4 years ago

My SubtitleEdit hangs when I start the OCR process. After reading this issue, I decided to upgrade my tesseract from 4.0.0 shipped by Ubuntu repo, to 4.1.1 shipped by tesseract's PPA.

After tesseract upgrade, it worked fine for a subtitle...until halfway. Now all lines are coloured in orange with empty result.

I decided to just go with Wine, much less headache.

lxs602 commented 3 years ago

I have tried nOCR on the latest releases, but I think I prefer Tesseract.

Is there any chance someone can integrate the patch into the main code?

It looks like @xylographe is away, hope he is ok.

L

niksedk commented 2 years ago

It has been a while since we have seen @xylographe :( I hope he is well too.

I've just tested Tesseract (version 5) with latest Ubuntu... and that worked okay for me. Does Tesseract 5 work for you?

lxs602 commented 2 years ago

Hi, I can confirm it is now working on Ubuntu 22.04, with tesseract 5 installed from ppa.

Dictionary downloads by Subtitleedit seem to not be downloading the whole file on my system.

@cecoates - I then get the error message you did; "Tesseract returned with code 1", and lots of blank orange lines.

If you look in your dictionaries folder ( click on the link in Subtitleedit, or open /usr/share/tesseract/5/tessdata or similar), what size are the traineddata files? I found these were only about 130kb instead of ~14Mb - 20Mb.

When I downloaded them manually from here and placed them in the dictionary folder, tesseract and Subtitleedit then worked normally.