Closed hamelg closed 4 years ago
Can you please give me some video samples regarding this issue ?
The link is valid 30 days. http://dl.free.fr/k2j8OpZJF
Only -dvblang
is relevant to the problem
The fix doesn't work. Now, the -ocrlang option has no effect ...
$ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...
it doesn't select fra.traineddata file despite "-ocrlang fra".
Tested and it worked.
It seems ccextractor is unable to find the OCR data. You can set the TESSDATA_PREFIX
environment variable to select another place for it to be found.
for example here is the command I run:
$ TESSDATA_PREFIX=/nix/store/9yawzjj82bib4dr9x7y340w10c3k319y-tesseract-3.05.00/share/ ./ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra files/1008_20191216001500.ts
I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference.
$ ls -l /usr/share/tessdata/fra.traineddata
-rw-r--r-- 1 root root 14213351 Nov 11 2018 /usr/share/tessdata/fra.traineddata
$ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...
No captions were found in input.
It seems indeed that something is broken as there is no reason ccextractor isn't able to find the file by itself (in /usr/share/tessdata/
).
But anyway this isn't supposed to work. TESSDATA_PREFIX
should be set to the directory above tessdata
.
try like this:
$ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
Why is it trying to read eng.traineddata of you specified fra? That's definitely broken...
On Thu, Jan 16, 2020 at 1:34 PM hamelg notifications@github.com wrote:
I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference.
$ ls -l /usr/share/tessdata/fra.traineddata -rw-r--r-- 1 root root 14213351 Nov 11 2018 /usr/share/tessdata/fra.traineddata $ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts ... eng.traineddata not found! No Switching Possible ... No captions were found in input.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/1161?email_source=notifications&email_token=ABNMTWNKGSWE5JRLDO6CF7DQ6DHHRA5CNFSM4KA7ICAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJFUGAA#issuecomment-575357696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNMTWMKB76P3OQQFWJUAQ3Q6DHHRANCNFSM4KA7ICAA .
@NilsIrl try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.
try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.
I've tested that ccextractor
is using fra.tessdata
. but let me check again.
try like this:
$ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
ditto, same result
It seems to have been broken before.
Well, since both you guys @NilsIrl and @hamelg are around right now seems like it can be solved once and for all really quickly.
By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).
280b4308f7f7ff769fd9c3fe2b03a7259644bfdb is broken for me as well. (last PR before CGI and v0.88)
By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).
$ strace -e file ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
Opening file: 1008_20191216001500.ts
openat(AT_FDCWD, "1008_20191216001500.ts", O_RDONLY) = 3
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
openat(AT_FDCWD, "/usr/local/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tesseract-ocr/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tesseract-ocr/4.00/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
eng.traineddata not found! No Switching Possible
openat(AT_FDCWD, "sbt", O_RDWR|O_CREAT|O_TRUNC, 0600) = 4
I will not have enough time to fix it today. Could you try on 0.88 to confirm that -ocrlang
doesn't work there as well?
What I see (just visually inspecting the source code) is that we attempt to switch to english if we can't find the selected language:
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ocr.c (line 169)
That function char* probe_tessdata_location(int lang_index)
expects an integer which is used to look up in an array... probably that's one of the problems to begin with.
Changing probe_tessdata_location to take a const char *
and removing probe_tessdata_location_string I think is a good thing.
Well, get it working for everybody and then I'll be OK with your solution whatever it is :-) As you soon as yourself, @anshul1912 and @hamelg all agree that it's working I'll merge (well, after testing on Windows myself)
@hamelg with the latest PR does it work?
Yes, it works fine now. I just have the wrong message at exit : No captions were found in input. but it found all the subtitles. Thanks !
No captions were found in input.
Okay I will look into that
CCExtractor detailed version info Version: 0.88 Git commit: bc3d729e30a751feb9b854a54c085f0e81a99134 Compilation date: 2019-12-25 File SHA256: Could not open file Libraries used by CCExtractor Tesseract Version: 4.1.1 Leptonica Version: leptonica-1.78.0 libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.2.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi
In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
Issue description
Some french dvb channels doesn't use ISO 639-2 to specify the language for the subtitles stream. Here is an example :
On the subtitle streams, the language code should be "fra", and not "fre".
The following command fails to find the subtitle stream :
It fails because the code "fre" doesn't exist in the language array (see lib_ccx/ccx_common_constants.c).