Open claunia opened 8 years ago
@claunia we're going to spend a bit of time on this. What's the current status? (with the last CCExtractor I mean)
Hello, I can't seem to find the test files repository for this one!
GSOC qualification: This issue gives 2 points.
The zip files contain in total of 4 video files:
Star Wars Rebels_Disney Channel_2014-12-12_22-24.ts: The video contains teletext subtitle.
The output is generally good, except 2 lines missing.
It is caused by fuzzy_memcmp
in telxcc.c:809, which seems to discard the
previous line if the current line has similar content to it.
EDIT: with -nolevdist
, the missing lines can now be outputed
In addition, I find -out=spupng
doesn't work with teletext. Don't know if it
is expected. It crashes because of a bug in ccx_encoders_spupng.c:14. After
fixing it (Patch #864 ), it will generate .png files with size of 0 byte (i.e. empty).
Star Wars Rebels_Disney Channel_2014-12-12_22-24_cortado.ts:
It has a teletext subtitle stream but neither VLC nor Potplayer can display any subtitle. CCExtractor can't extract anything from it.
I think it can be because the video itself doesn't actually have any subtitle.
Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts
It contains DVB subtitles, but CCExtractor isn't able to extract anything from it.
-out=spupng
doesn't work either.
The cause is the stream doesn't send DVBSUB_DISPLAY_SEGMENT. Although the case is considered, it is poorly handled. Patch: #866
Cine Clan TVE Perez, el ratoncito de tus sueños 2_cortado.ts
Same problem as "Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts"
During the debugging, I also discovered a heap corruption problem caused by add_ocrtext2str (Patch: #865 )
@harrynull use -nolevdist if you want fuzzy_memcpy to behave like memcpy
First one (teletext) works fine. However the 2nd one shows a bunch of messages:
In ocr_bitmap: Failed to perform OCR. Skipped. In ocr_bitmap: Failed to perform OCR. Skipped. In ocr_bitmap: Failed to perform OCR. Skipped.
Takes forever, too.
It is caused by some of the images are totally empty and invalid for some reasons. But it should not affect the output file.
@harrynull It does, check this out:
670 01:00:15,877 --> 01:00:20,676 Enos oi onimnro nno dnonio otnpnnio pnno oroannthonio.
671 01:00:20,677 --> 01:00:23,116 TI‘QMG. Monono. oi no“n
672 01:00:23,117 --> 01:00:27,756 sono wondido o onion nnos olrozoo non on.
That's total gibberish :-) There's definitely a correlation between those errors and the incorrect lines. It's definitely better than before, and there's lots of good output - but still not perfect.
@cfsmp3 It works well here:
670
01:00:15,877 --> 01:00:20,676
<font color="#00c8c6">Erao ol prlmuro ono onorla</font>
<font color="#00c8c6">atoporlo pora prognnflarlo.</font>
671
01:00:20,677 --> 01:00:23,116
<font color="#00c8c6">Tardo.</font>
<font color="#00c8c6">Mahana. ol ratOn</font>
672
01:00:23,117 --> 01:00:27,756
<font color="#00c8c6">oora vondldo a onlon</font>
<font color="#00c8c6">mao ofruzoa por or.</font>
Did you forget to put spa.traineddata
in the right place?
But I do found that sometime doesn't close
24
00:03:54,997 --> 00:03:57,836
<font color="#c7c800">¢Como fue Ia fiesta?</font>
<font color="#c7c800"></font><font color="#d6d6d6">-Estuvimos esperandole.
In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.
That stuff in 670, 671 and 672 is not Spanish, believe me :-) (or I suspect, any other language)
On Fri, Jan 12, 2018 at 6:45 PM, Null notifications@github.com wrote:
@cfsmp3 https://github.com/cfsmp3 It works well here:
670 01:00:15,877 --> 01:00:20,676 Erao ol prlmuro ono onorla atoporlo pora prognnflarlo.
671 01:00:20,677 --> 01:00:23,116 Tardo. Mahana. ol ratOn
672 01:00:23,117 --> 01:00:27,756 oora vondldo a onlon mao ofruzoa por or.
Did you forget to put spa.traineddata in the right place?
But I do found that sometime doesn't close
24 00:03:54,997 --> 00:03:57,836 ¢Como fue Ia fiesta? -Estuvimos esperandole.
In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/243#issuecomment-357404171, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2W1IS-1s-n7YAj-fi_B17p7L4ekoks5tKBivgaJpZM4GZuQu .
Status update: Still broken. Possibly differently. The file that matters is Cine Clan TVE *.ts (ignore the Disney one).
We get lots of these messages:
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
and a bonus:
Direct leak of 216 byte(s) in 3 object(s) allocated from:
#0 0x7f77522bf90f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
#1 0x556c85248761 in dvbsub_init_decoder ../src/lib_ccx/dvb_subtitle_decoder.c:424
#2 0x556c8529ee4d in parse_PMT ../src/lib_ccx/ts_tables.c:346
#3 0x556c85272f9e in ts_readstream ../src/lib_ccx/ts_functions.c:752
#4 0x556c85275167 in ts_get_more_data ../src/lib_ccx/ts_functions.c:980
#5 0x556c852a9a9f in general_loop ../src/lib_ccx/general_loop.c:1051
#6 0x556c851a7986 in api_start ../src/ccextractor.c:205
#7 0x556c851a9cdb in main ../src/ccextractor.c:463
#8 0x7f775162350f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
On some files subtitles appear empty even when the program was subbed, or corrupt, containing garbage characters.
Tried recording from Imagenio and from DVB-T in Spain, happens in all tested broadcasts.
Test files have been put on /repository/Natalia
Regards