CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
724 stars 428 forks source link

CCextractor says "OCR subsystem not present" although compiled with OCR support #442

Closed ghost closed 3 years ago

ghost commented 8 years ago

On a Linux system I have compiled version 0.82 of CCextractor with these lines:

cmake -DWITH_OCR=ON ../src/ make sudo make install

but when I run this command

ccextractor -pn 7176 -codec dvbsub Pointless.ts (program number taken from 'mediainfo')

CCextractor amongst other outputs this:

Opening file: Pointless.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output Creating Pointless.srt

The generated Pointless.srt is 3 bytes long and contains this hex string "bbef 00bf".

I have installed leptonica-devel and tesseract-ocr-devel.

Have I missed something during the compilation or/and am I using the wrong parameter in my call of CCextractor?

cfsmp3 commented 8 years ago

You need to build with OCR support, not just have the libraries installed.

On Thu, Nov 24, 2016 at 1:08 PM, Bent Bagger notifications@github.com wrote:

On a Linux system I have compiled version 0.82 of CCextractor with these lines:

cmake -DWITH_OCR=ON ../src/ make sudo make install

but when I run this command

ccextractor -pn 7176 -codec dvbsub Pointless.ts (program number taken from 'mediainfo')

CCextractor amongst other outputs this:

Opening file: Pointless.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output Creating Pointless.srt

The generated Pointless.srt is 3 bytes long and contains this hex string "bbef 00bf".

I have installed leptonica-devel and tesseract-ocr-devel.

Have I missed something during the compilation or/and am I using the wrong parameter in my call of CCextractor?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/442, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2Y9JaYYbKKyN_xMJVK_pH1hsEHStks5rBfy-gaJpZM4K79j9 .

ghost commented 8 years ago

I thought I had, but apparently not. Anyway, I got it working by running these commands in /usr/local/src/ccextractor (a soft link to ccextractor.0.82):

cd build cmake -DWITH_OCR=ON ../src/ cd ../linux/ make clean make ENABLE_OCR=yes make install

('make clean' only to start from a clean slate).

Allow me an additional question: When I now run CCextractor I do get a .srt file but CCextractor complains a little:

Opening file: Pointless.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode dan.traineddata not found! Switching to English swe.traineddata not found! Switching to English fin.traineddata not found! Switching to English Creating Pointless.srt

Using English trained data on Scandinavian texts makes for funny results!

The tesseract-ocr trained data is installed in /usr/share/tessdata/. So my additional question is actually two:

  1. How do I get CCextractor to read the trained data?
  2. How do I specify to CCextractor which language I want?
ghost commented 8 years ago

I may have part of an answer to my question 1 above. When I run an 'strace' on CCextractor I found that CCextractor looks locally to find the trained data:

openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) write(1, "dan.traineddata not found! Switc"..., 48dan.traineddata not found! Switching to English

but globally to find the English data:

open("/usr/share/tessdata/eng.traineddata", O_RDONLY) = 4

When I added a link from current directory to /usr/shar/tessdata CCextractor stopped complaining over missing data.

It is inconvenient to have to add links to every directory when I have videoes stored, so is this a fault or a feature?

ykarim commented 8 years ago

@BentB may you please close this issue as the original "OCR subsystem not present" is now resolved. You can open another issue for your new problem if it still exists.

ghost commented 7 years ago

I have moved the above questions on tesseract data to a new issue 448

anshul1912 commented 7 years ago

Please help us with pull request

It's not bug nither feature, it's incomplete implementation.

-Anshul

On 25-Nov-2016 5:28 PM, "Bent Bagger" notifications@github.com wrote:

I may have part of an answer to my question 1 above. When I run an 'strace' on CCextractor I found that CCextractor looks locally to find the trained data:

openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) write(1, "dan.traineddata not found! Switc"..., 48dan.traineddata not found! Switching to English

but globally to find the English data:

open("/usr/share/tessdata/eng.traineddata", O_RDONLY) = 4

When I added a link from current directory to /usr/shar/tessdata CCextractor stopped complaining over missing data.

It is inconvenient to have to add links to every directory when I have videoes stored, so is this a fault or a feature?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/442#issuecomment-262941877, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCOGvZrnB2Bx4tfQg8NGIWLAP0WBp2Dks5rBs1ngaJpZM4K79j9 .

ghost commented 7 years ago

@anshul1912 I'm not quite familiar with life here at Github so please expand a bit on what you mean by "Please help us with pull request". I know 'pull' from Git, but not in this context. Sorry about that.

wojtekw commented 4 years ago

I get the same error "OCR subsystem not present" on MacOS but leptonic and tesseract are installed on system. CCX -v shows: Version: 0.88 Git commit: Unknown Compilation date: 2020-02-04 File SHA256: fa4b6f64af9f923a0fca842ae017a189740de63916188b8afa43e6c00acb07b5 Libraries used by CCExtractor libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi

Do You know where can be a problem ?

rialg commented 4 years ago

Hello, I get the same issue on Ubuntu 16.04 using Tesseract 4.1.1, even after following @ghost 's compilation guide.

Linux desktop 4.15.0-76-generic #86~16.04.1-Ubuntu SMP Mon Jan 20 11:02:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

CCExtractor detailed version info Version: 0.88 Git commit: 6697ed34967343830178f8452e276ab0d94f08e0 Compilation date: 2020-02-04 File SHA256: Could not open file Libraries used by CCExtractor libGPAC Version: 0.7.2-DEV zlib: 1.2.8 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.2.54 FreeType libhash nuklear libzvbi

Reading from UDP socket 226.51.0.0:1234
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
DVB subtitles detected, OCR subsystem not present. Use -out=spupng for graphic output
TS continuity counter not incremented prev/curr 4/6

Found large gap(1072860) in PTS! Trying to recover ...

Found large gap(1072861) in PTS! Trying to recover ...

Found large gap(1072864) in PTS! Trying to recover ...

Found large gap(1072865) in PTS! Trying to recover ...

Found large gap(1072862) in PTS! Trying to recover ...

Found large gap(1072863) in PTS! Trying to recover ...
wojtekw commented 4 years ago

@cfsmp3 Can You reopen issue ?

cfsmp3 commented 3 years ago

Closing as we've made a lot of changes in build lately so I don't know if this is still an issue or not

@wojtekw @rialg let me know if it's still a problem in master

kousthub97 commented 1 year ago

Hello, I am facing the same issue with tesseract 4.1.1 leptonica-1.76.0 I tried compiling with the below steps and @ghost's both haven't worked for me. Please let me know if any changes needs to be done while compiling.

mkdir build cd build cmake -DWITH_OCR=ON -DWITHOUT_RUST=ON ../src/ make

I am using Centos 8 for compiling. Below is the ccextractor --version output

CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc

CCExtractor detailed version info Version: 0.94 Git commit: 35e73c1c90ce3ca69394d3523836bb1cdec28f11 Compilation date: 2023-08-04 CEA-708 decoder: C File SHA256: 08b9e909cc730e591a4331eef6dd45584a20e4a92c8dbf3fc37bf570f48ce79e Libraries used by CCExtractor Tesseract Version: 4.1.1 Leptonica Version: leptonica-1.76.0 libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi

ldd output for ccextractor

linux-vdso.so.1 (0x00007ffcf98db000) libm.so.6 => /lib64/libm.so.6 (0x00007fe365526000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe365306000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fe365102000) libtesseract.so.4 => /lib64/libtesseract.so.4 (0x00007fe364b9b000) liblept.so.5 => /lib64/liblept.so.5 (0x00007fe36471a000) libc.so.6 => /lib64/libc.so.6 (0x00007fe364358000) /lib64/ld-linux-x86-64.so.2 (0x00007fe3658a8000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fe363fc3000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fe363dab000) libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fe363b73000) libpng16.so.16 => /lib64/libpng16.so.16 (0x00007fe36393e000) libz.so.1 => /lib64/libz.so.1 (0x00007fe363727000) libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007fe3634be000) libgif.so.7 => /lib64/libgif.so.7 (0x00007fe3632b4000) libtiff.so.5 => /lib64/libtiff.so.5 (0x00007fe36303b000) libwebp.so.7 => /lib64/libwebp.so.7 (0x00007fe362dcd000) libjbig.so.2.1 => /lib64/libjbig.so.2.1 (0x00007fe362bc1000)

If I directly use the tesseract commands it was working image-to-text conversion.

Neo2SHYAlien commented 1 year ago

@kousthub97 try compile previous commit 0264e7da2be67182deb031228eb07e6ed4943c81 or v0.94 tag :) Both should work

kousthub97 commented 1 year ago

@Neo2SHYAlien Thanks for help it worked with v0.94 tag