CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
727 stars 430 forks source link

[BUG] dvblang option doesn't work #1161

Closed hamelg closed 4 years ago

hamelg commented 4 years ago

CCExtractor detailed version info Version: 0.88 Git commit: bc3d729e30a751feb9b854a54c085f0e81a99134 Compilation date: 2019-12-25 File SHA256: Could not open file Libraries used by CCExtractor Tesseract Version: 4.1.1 Leptonica Version: leptonica-1.78.0 libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.2.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

Necessary information

Issue description

Some french dvb channels doesn't use ISO 639-2 to specify the language for the subtitles stream. Here is an example :

Input #0, mpegts, from '1008_20191216001500.ts':
  Duration: 00:04:59.97, start: 1.400000, bitrate: 4007 kb/s
  Program 1 
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
    Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709, top first), 1920x1080 [SAR 1:1 DAR 16:9], 25 fps, 25 tbr, 90k tbn, 50 tbc
    Stream #0:1[0x101](fre): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s
    Stream #0:2[0x102](qaa): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s
    Stream #0:3[0x103](fra): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s (visual impaired) (descriptions)
    Stream #0:4[0x104](fre): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006) (hearing impaired)
    Stream #0:5[0x105](fre): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006)
    Stream #0:6[0x106]: Data: bin_data ([6][0][0][0] / 0x0006)

On the subtitle streams, the language code should be "fra", and not "fre".

The following command fails to find the subtitle stream :

$ ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra mythtv/1008_20191216001500/1008_20191216001500.ts
...
Analyzing data in general mode
Ignoring stream language 'und' not equal to dvblang 'fre'
Ignoring stream language 'und' not equal to dvblang 'fre'
...
No captions were found in input.

It fails because the code "fre" doesn't exist in the language array (see lib_ccx/ccx_common_constants.c).

gauravahlawat81 commented 4 years ago

Can you please give me some video samples regarding this issue ?

hamelg commented 4 years ago

The link is valid 30 days. http://dl.free.fr/k2j8OpZJF

NilsIrl commented 4 years ago

Only -dvblang is relevant to the problem

hamelg commented 4 years ago

The fix doesn't work. Now, the -ocrlang option has no effect ...

$ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...

it doesn't select fra.traineddata file despite "-ocrlang fra".

NilsIrl commented 4 years ago

Tested and it worked.

It seems ccextractor is unable to find the OCR data. You can set the TESSDATA_PREFIX environment variable to select another place for it to be found.

for example here is the command I run:

$ TESSDATA_PREFIX=/nix/store/9yawzjj82bib4dr9x7y340w10c3k319y-tesseract-3.05.00/share/ ./ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra files/1008_20191216001500.ts
hamelg commented 4 years ago

I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference.


$ ls -l /usr/share/tessdata/fra.traineddata 
-rw-r--r-- 1 root root 14213351 Nov 11  2018 /usr/share/tessdata/fra.traineddata
$ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...
No captions were found in input.
NilsIrl commented 4 years ago

It seems indeed that something is broken as there is no reason ccextractor isn't able to find the file by itself (in /usr/share/tessdata/).

But anyway this isn't supposed to work. TESSDATA_PREFIX should be set to the directory above tessdata.

try like this:

$ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
cfsmp3 commented 4 years ago

Why is it trying to read eng.traineddata of you specified fra? That's definitely broken...

On Thu, Jan 16, 2020 at 1:34 PM hamelg notifications@github.com wrote:

I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference.

$ ls -l /usr/share/tessdata/fra.traineddata -rw-r--r-- 1 root root 14213351 Nov 11 2018 /usr/share/tessdata/fra.traineddata $ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts ... eng.traineddata not found! No Switching Possible ... No captions were found in input.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/1161?email_source=notifications&email_token=ABNMTWNKGSWE5JRLDO6CF7DQ6DHHRA5CNFSM4KA7ICAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJFUGAA#issuecomment-575357696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNMTWMKB76P3OQQFWJUAQ3Q6DHHRANCNFSM4KA7ICAA .

cfsmp3 commented 4 years ago

@NilsIrl try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.

NilsIrl commented 4 years ago

try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.

I've tested that ccextractor is using fra.tessdata. but let me check again.

hamelg commented 4 years ago

try like this:

$ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts

ditto, same result

NilsIrl commented 4 years ago

It seems to have been broken before.

cfsmp3 commented 4 years ago

Well, since both you guys @NilsIrl and @hamelg are around right now seems like it can be solved once and for all really quickly.

By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).

NilsIrl commented 4 years ago

280b4308f7f7ff769fd9c3fe2b03a7259644bfdb is broken for me as well. (last PR before CGI and v0.88)

hamelg commented 4 years ago

By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).

$ strace -e file ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
Opening file: 1008_20191216001500.ts
openat(AT_FDCWD, "1008_20191216001500.ts", O_RDONLY) = 3
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
openat(AT_FDCWD, "/usr/local/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tesseract-ocr/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tesseract-ocr/4.00/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
eng.traineddata not found! No Switching Possible
openat(AT_FDCWD, "sbt", O_RDWR|O_CREAT|O_TRUNC, 0600) = 4
NilsIrl commented 4 years ago

I will not have enough time to fix it today. Could you try on 0.88 to confirm that -ocrlang doesn't work there as well?

cfsmp3 commented 4 years ago

What I see (just visually inspecting the source code) is that we attempt to switch to english if we can't find the selected language:

https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ocr.c (line 169)

That function char* probe_tessdata_location(int lang_index)

expects an integer which is used to look up in an array... probably that's one of the problems to begin with.

NilsIrl commented 4 years ago

Changing probe_tessdata_location to take a const char * and removing probe_tessdata_location_string I think is a good thing.

cfsmp3 commented 4 years ago

Well, get it working for everybody and then I'll be OK with your solution whatever it is :-) As you soon as yourself, @anshul1912 and @hamelg all agree that it's working I'll merge (well, after testing on Windows myself)

NilsIrl commented 4 years ago

@hamelg with the latest PR does it work?

hamelg commented 4 years ago

Yes, it works fine now. I just have the wrong message at exit : No captions were found in input. but it found all the subtitles. Thanks !

NilsIrl commented 4 years ago

No captions were found in input.

Okay I will look into that