CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
693 stars 422 forks source link

[BUG] Segmentation Fault when converting from McPoodle raw to webVTT #1294

Closed bbgdzxng1 closed 3 years ago

bbgdzxng1 commented 3 years ago

CCExtractor version: 0.88

In raising this issue, I confirm the following:

Necessary information

Download cc sample from https://archive.org/download/cc_sample

$ curl --silent --location --request GET "https://archive.org/download/cc_sample/cc_sample.scc" -o cc_sample.scc
$ dos2unix --newfile cc_sample.scc cc_sample_dos2unix.scc

Convert scc file to McPoodle's raw format using McPoodle's SCCTOOLS' scc2raw.pl, available from http://www.theneitherworld.com/mcpoodle/SCC_TOOLS/scc2raw.pl. Carlos uses McPoodle's SCCTOOLS as part of this project.

$ perl scc2raw.pl cc_sample_dos2unix.scc cc_sample_dos2unix.bin

At this point, we have a McPoodle raw format file, which has been generated by McPoodle's own tool.

Now attempting to convert McPoodle's raw format to VTT, using ccextractor 0.88 (see Additional Information section for error message)

$ ccextractor -debug -in=raw -out=webvtt-full -utf8 cc_sample_dos2unix.bin -o cc_sample_dos2unix.vtt

Additional information

Converting from McPoodle Raw format to SRT is successful. Converting from McPoodle Raw format to VTT fails with Segmentation Fault.

Check input using report

$ ccextractor -debug -in=raw -out=report -utf8 cc_sample_dos2unix.bin
File: cc_sample_dos2unix.bin
Stream Mode: McPoodle's raw
EIA-608: No
CEA-708: No
MPEG-4 Timed Text: No

Converting to SRT is successful

$ ccextractor -debug -in=raw -out=srt -utf8 cc_sample_dos2unix.bin -o cc_sample_dos2unix.srt
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: Files (1): cc_sample_dos2unix.bin
[Extract: 1] [Stream mode: McPoodle's raw]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: Yes] [Buffer input: No]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .srt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: No] [Clock frequency: 90000]
[Teletext page: Autodetect]
[Start credits text: None]
[Quantisation-mode: CCExtractor's internal function]

-----------------------------------------------------------------
Opening file: cc_sample_dos2unix.bin
Analyzing data in McPoodle raw mode

Total frames time:    00:00:00:000  (0 frames at 29.97fps)

Min PTS:                00:00:00:001
Max PTS:                00:01:09:070
Length:              00:01:09:069
Done, processing time = 0 seconds
Issues? Open a ticket here
https://github.com/CCExtractor/ccextractor/issues
$ head -n20 cc_sample_dos2unix.srt 
1
00:00:09,243 --> 00:00:12,378
                  Is that what  
                  the Americans 
                  call doodling?

2
00:00:12,413 --> 00:00:13,278
It is more serious              

3
00:00:13,314 --> 00:00:16,081
than you could                  
possibly realize                
Charlotte                       

4
00:00:20,221 --> 00:00:21,020
Good

Converting to webVTT causes Segmentation Fault 11

$ ccextractor -debug -in=raw -out=webvtt-full -utf8 cc_sample_dos2unix.bin -o cc_sample_dos2unix.vtt
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: Files (1): cc_sample_dos2unix.bin
[Extract: 1] [Stream mode: McPoodle's raw]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: Yes] [Buffer input: No]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .vtt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: No] [Clock frequency: 90000]
[Teletext page: Autodetect]
[Start credits text: None]
[Quantisation-mode: CCExtractor's internal function]

-----------------------------------------------------------------
Opening file: cc_sample_dos2unix.bin
Analyzing data in McPoodle raw mode
Segmentation fault: 11

Full commands for replication and sample file are included in this ticket.

NilsIrl commented 3 years ago

The problem is a null pointer de-reference. timing is a null pointer so trying to get sync_pts2fts_set crashes.

https://github.com/CCExtractor/ccextractor/blob/b1c22e503412f455d3740d0ce5fe9cabc66fe701/src/lib_ccx/ccx_encoders_webvtt.c#L210

NilsIrl commented 3 years ago

If that timing is meant to be null and it isn't a problem with the parser, should we just set the values to 0? That seems to be what is recommended according to https://sdks.support.brightcove.com/features/synchronizing-webvtt-captions.html

siv2r commented 3 years ago

If that timing is meant to be null and it isn't a problem with the parser, should we just set the values to 0? That seems to be what is recommended according to https://sdks.support.brightcove.com/features/synchronizing-webvtt-captions.html

Here, by "values," do you mean the variables context->timing->sync_pts2fts_pts, h1, m1, s1, ms1 (in the code below) should be set to 0? This makes sense according to the recommended settings.

https://github.com/CCExtractor/ccextractor/blob/cf84757e02889ffa0231624fdf941eafded59906/src/lib_ccx/ccx_encoders_webvtt.c#L220-L222

If you don't mind, can I try to look into whether context->timing is supposed to be null or not?

NilsIrl commented 3 years ago

If that timing is meant to be null and it isn't a problem with the parser, should we just set the values to 0? That seems to be what is recommended according to https://sdks.support.brightcove.com/features/synchronizing-webvtt-captions.html

Here, by "values," do you mean the variables context->timing->sync_pts2fts_pts, h1, m1, s1, ms1 (in the code below) should be set to 0? This makes sense according to the recommended settings.

No, I meant we should write the header with as values 0 (instead of what would have been in sync_pts2fts_fts).

siv2r commented 3 years ago

No, I meant we should write the header with as values 0 (instead of what would have been in sync_pts2fts_fts).

I wanted to say the same. I should not have used said "set" variables. Shouldn't we also need to write 0's (in the header) instead of h1, m1, s1, ms1? Since they are dependent on sync_pts2fts_fts (due to the function below) https://github.com/CCExtractor/ccextractor/blob/cf84757e02889ffa0231624fdf941eafded59906/src/lib_ccx/ccx_encoders_webvtt.c#L215

siv2r commented 3 years ago

I compared the webvtt encoding for cc_sample_dos2unix.bin (given in this thread) and UK_ITVBE.ts (provided on ccextractor's tv samples page) and found something really weird.

https://github.com/CCExtractor/ccextractor/blob/cf84757e02889ffa0231624fdf941eafded59906/src/lib_ccx/ccx_encoders_common.c#L935-L939 The problem occurs in the initialization of the encoder context. The malloc function acts differently for UK_ITVBE.ts (here, ctx->timing is not null) and cc_sample_dos2unix.bin (here, ctx->timing is set to null)

I added a fix by allocating memory for ctx->timing but would love to discuss more why this is happening :)

NilsIrl commented 3 years ago

ctx->timing is not null because timing information is included in the file. I assume this is not included in McPoodle raw (which just contains closed captions), or not in this specific mcpoodle raw file or ccextractor is not able to parse it correctly.

siv2r commented 3 years ago

The malloc function acts differently for UK_ITVBE.ts (here, ctx->timing is not null) and cc_sample_dos2unix.bin (here, ctx->timing is set to null)

By this, I meant that the value of ctx->timing just after the initialization of ctx (using malloc) was different in both cases. Shouldn't the ctx->timing value be the same in both cases since this is a simple initialization? (later on, it may change depending on the input)

I might have made a mistake somewhere (while debugging in vs code). I will check this again.

I assume this is not included in McPoodle raw (which just contains closed captions), or not in this specific mcpoodle raw file or ccextractor is not able to parse it correctly.

Will look into this.

bbgdzxng1 commented 3 years ago

I really appreciate you folks taking the time to look at this. I still plan on using McPoodle > ccextractor as a means of converting SCC > McP Raw > webVTT and I am very grateful that you have even looked, whatever the outcome. Thanks!

siv2r commented 3 years ago

I really appreciate you folks taking the time to look at this. I still plan on using McPoodle > ccextractor as a means of converting SCC > McP Raw > webVTT and I am very grateful that you have even looked, whatever the outcome. Thanks!

Happy to help. Hope to fix this soon!

bbgdzxng1 commented 3 years ago

Many thanks for trying to fix. I see that PR https://github.com/CCExtractor/ccextractor/pull/1298 has been closed. Just wanted to say thanks for your efforts so far and although the PR itself was not suitable for prod, I do still hope to use ccextractor to convert SCC > McP Raw > WebVTT. I'm currently going via SCC > McP Raw > SRT > WebVTT, but a conversion to WebVTT is hopefully a legitimate use case (for all the benefits of preserving markup, data integrity, character sets etc).

Are there any roadmap plans to support SCC as an input format? I know that ccextractor is a caption extractor rather than a caption convertor, but given the similarities between 608 and SCC, maybe that is a better way to solve for this use case, rather than receiving tickets about a proprietary and legacy McPoodle input format, which I appreciate is an edge case. If this is indeed a possibility, let me know and I would happily raise a feature request for SCC input, rather than a bug report for McPoodle.

There are very few good SCC convertors out there, and ccextractor is the best I have found for 608.

SuvigyaJain1 commented 3 years ago

Is this issue still open? And is someone working on it?

canihavesomecoffee commented 3 years ago

Yes, still open, and no one's working on it AFAIK.

SuvigyaJain1 commented 3 years ago

So I had a look through the file and I think the source of the problem is that the scc file doesn't have the information required to build the X-TIMESTAMP-MAP=MPEGTS header?. Thus adding a check for context->timing != NULL before accessing context->timing->sync_pts2fts in (lib_ccx/ccx_encoders_webvtt.c; line 210) can avoid the segmentation fault in such cases. As far as I know its not a problem for a .vtt file to not contain the synchronisation offset in the header. The default is assumed to be 0 if none is present.

I tried running the provided files and am able to get the .vtt file without problem I have attached a screenshot of the same Screenshot_20210427_112654

SuvigyaJain1 commented 3 years ago

I will admit I'm baffled how my PR could break so many tests. How adding 1 line in webvtt encoder file breaks everything else as well, is confusing me and after going through the tests and downloading the files locally and running them they seem to be working. Am I missing something here? Note: This is my first contribution to this repo

canihavesomecoffee commented 3 years ago

I will admit I'm baffled how my PR could break so many tests. How adding 1 line in webvtt encoder file breaks everything else as well, is confusing me and after going through the tests and downloading the files locally and running them they seem to be working. Am I missing something here? Note: This is my first contribution to this repo

The sample platform isn't bug-free either ;)

I've looked at the actual results page, and there there's fewer tests broken :)

Since it's a Windows report, the failing test for the dictionary (using a unix path) is expected. The failing test in Broken category can be ignored. The 3 failing ones in DVB worry me a bit more, but might be present in the main repo already, and the same goes for the Teletext one.

That's about it... So I think it's safe to say you can reopen that PR.

SuvigyaJain1 commented 3 years ago

I see! Thanks. makes more sense. I have reopened the PR as suggested. Ill look at the DVB tests in a bit more detail just to be on the safer side.

bbgdzxng1 commented 3 years ago

Confirming that this is resolved in 0.89. Thank-you...

$ curl --silent --location --request GET "https://archive.org/download/cc_sample/cc_sample.scc" -o cc_sample.scc
$ dos2unix --newfile cc_sample.scc cc_sample_dos2unix.scc
$ perl /opt/mcpoodle/scc2raw.pl cc_sample_dos2unix.scc cc_sample_dos2unix.bin
$ ccextractor -debug -in=raw -out=webvtt-full -utf8 cc_sample_dos2unix.bin -o cc_sample_dos2unix.vtt
$ head -n10 "cc_sample_dos2unix.vtt"
WEBVTT

STYLE

/* default values */
::cue {
  line-height: 5.33vh;
  font-size: 4.1vh;
  font-family: monospace;
$ tail -n10 "cc_sample_dos2unix.vtt"

00:01:03.531 --> 00:01:04.430 line:84.66%
Yes                             

00:01:04.465 --> 00:01:06.298 line:79.33%
              I pray I never    

00:01:04.465 --> 00:01:06.298 line:84.66%
              have to deliver it

Thank-you to all who contributed to the investigation, development, testing and release.

ccextractor can now successfully convert scc-to-srt, scc-to-vtt (and probably other outputs) with the interim step of going via McPoodle raw format.