CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
712 stars 424 forks source link

@ in teletext-subs saved as asterisk #249

Closed hurda closed 8 years ago

hurda commented 8 years ago

CCExtractor 0.77 and git-677fee4 File: http://www.mediafire.com/download/q6ebvmzwe1prvi3/cce-at-sign.7z (25MB)

Teletext: at sign

SRT:

2
00:00:12,840 --> 00:00:16,520
ORF 2015
untertitel*orf.at
hurda commented 8 years ago

Addendum: Also affects other output-formats, like SAMI and TTXT.

anshul1912 commented 8 years ago

when I try to download above file, it shows deleted

cfsmp3 commented 8 years ago

Please reopen when a working link is available.

hurda commented 8 years ago

http://www.mediafire.com/download/q6ebvmzwe1prvi3/cce-at-sign.7z

cfsmp3 commented 8 years ago

I'm looking into this. That * is written to the buffer here:

ctx->page_buffer.text[y][i] = telx_to_ucs2(packet->data[i]);

packet->data[i] contains 42 (0x2a) which is indeed an asterisk http://www.columbia.edu/kermit/ucs2.html

Dhrumil2910 commented 8 years ago

Not able to Download the file. Can you pls re-upload the file?

cfsmp3 commented 8 years ago

It seems to work fine for me... is there an error message when trying to download the file?

On Wed, Mar 9, 2016 at 12:19 PM, Dhrumil2910 notifications@github.com wrote:

Not able to Download the file. Can you pls re-upload the file?

— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-194247124 .

Dhrumil2910 commented 8 years ago

It is because of the college proxy server which is denying the download

cfsmp3 commented 8 years ago

OK, I've uploaded it to slack. Maybe it can be downloaded from there?

On Wed, Mar 9, 2016 at 3:56 PM, Dhrumil2910 notifications@github.com wrote:

It is because of the college proxy server which is denying the download

— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-194332763 .

Dhrumil2910 commented 8 years ago

Ok , thanks a lot for your support

isacdaavid commented 8 years ago

I'm trying to track that 0x2a byte back through the pipeline to find where things went wrong, if anywhere; but I need to know what the adequate value would be.

For this test video telx_to_ucs2() converts 0x40 to '§' rather than '@' because of the local language (German) substitutions in the basic character set specified in ETS 300-706. Is that OK?

abhishek-vinjamoori commented 8 years ago

@isacdaavid , ideally telx_to_ucs2() should get an input of 64 to get back "@". But, there is no possible input, for which the output is "@"(according to current decoding)

isacdaavid commented 8 years ago

I think this bug is invalid after all. The asterisk is really there at offset 0xDBF164 in the file (value is 0x54 which is 0x2A in reverse endianess), and OP's software is responsible for outputting the at sign.

I followed that particular 0x2A back to tlt_process_pes_packet() where its endianess is reversed to 0x54, then after failing to find another transformation through several function calls and buffers all the way back until get_cinfo() I became suspicious that ccextractor had been doing the right thing all the time. Sure enough, I searched the binary file for "a7 37 54 f7" after printing the adjacent values according to ccextractor, and after changing that "54" to "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.

I can provide the hex-edited video and my debugging patch/pull request. If you like, I could also implement a change to telx_to_ucs2() that would output an at sign when it finds an asterisk, but I guess you don't want to introduce such behaviour. I'm interested in proving that I know some git and can make useful changes to your codebase, but I fear that if this bug gets closed without needing a patch then I will not have earned the points for my GSoC application :(

cfsmp3 commented 8 years ago

The issue is with supplementary charsets.

A good starting point to research this is google "supplementary charsets teletext". There's some other teletext applications around surely some of them get this right and we learn from them.

This is not about replacing one char with another generically (obviously that might work for this specific file but would break it for many others) but rather complete the supplementary charset implementation.

A good thing is that teletext specifications are public and totally free, so this also serve as an introduction to standard documents :-)

Notes to GSoC applicants

On Sat, Mar 12, 2016 at 5:06 AM, Isaac David notifications@github.com wrote:

I think this bug is invalid after all. The asterisk is really there at offset 0xDBF164 in the file (value is 0x54 which is 0x2A in reverse endianess), and OP's software is responsible for outputting the at sign.

I followed that particular 0x2A back to tlt_process_pes_packet() where its endianess is reversed to 0x54, then after failing to find another transformation through several function calls and buffers all the way back until get_cinfo() I became suspicious that ccextractor had been doing the right thing all the time. Sure enough, I searched the binary file for "a7 37 54 f7" after printing the adjacent values according to ccextractor, and after changing that "54" for "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.

I can provide the hex-edited video and my debugging patch/pull request. If you like, I could also implement a change to telx_to_ucs2() that would output an at sign when it finds an asterisk, but I guess you don't want to introduce such behaviour. I'm interested in proving that I know some git and can make useful changes to your codebase, but I fear that if this bug gets closed without needing a patch then I would not have earned the points for my GSoC application :(

— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-195655619 .

hurda commented 8 years ago

A good starting point to research this is google "supplementary charsets teletext". There's some other teletext applications around surely some of them get this right and we learn from them.

Extracting the subtitles using ProjectX 0.91.0.10 portable also outputs "@". Maybe it helps.

EDIT: http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/subtitle/Teletext.java?view=annotate http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/subtitle/CharSet.java?view=markup

http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/parser/StreamProcessTeletext.java?view=annotate http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/parser/StreamProcessSubpicture.java?view=annotate

EDIT2: VLC (2.2.2) shows @ too. http://git.videolan.org/?p=vlc.git;a=blob;f=modules/codec/telx.c;h=4f8842a95f4a94cb326d3e48234014852f04c235;hb=HEAD

To check this, you'll have to use this file: http://www.mediafire.com/download/7fwbqdw57sxykby/at-sign_teletext_pcr-pts.ts The other has a difference between the PCR-timestamps of A/V and Teletext of almost three hours, which VLC apparently can't handle.

EDIT3:

and after changing that "54" to "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.

While this changes what is being output by ccextractor, DVBViewer, ProjectX and VLC are still showing @. How's that possible?

isacdaavid commented 8 years ago

Quick update: I couldn't find the @ in any of the supplementary (AKA G2) character sets. I still need to find more information on the second G0 sets (mentioned in section 15.3 in the ETS 300 706) and modified G0 and G2 sets (part of Teletext level 2.5 and 3.5, mentioned in section 15.4 in the standard).

@hurda Thanks. I will definitely see what other projects are doing if I fail to find a satisfactory explanation in those extra character sets.

EDIT: I found it. This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set. Quoting from it:

NOTE 3: The @ symbol replaces the * symbol at position 2/A when the table is accessed via a packet X/26 Column Address triplet with Mode Description = 10 000 and Data = 0101010. See clause 12.2.4.

Time to implement it!

cfsmp3 commented 8 years ago

It's in table 36 (Latin National Option subset), in the English row. Page 115 of ETS 300 706: May 1997

On Mon, Mar 14, 2016 at 5:30 AM, Isaac David notifications@github.com wrote:

Quick update: I couldn't find the @ in any of the supplementary (AKA G2) character sets. I still need to find more information on the second G0 sets (mentioned in section 15.3 in the ETS 300 706) and modified G0 and G2 sets (part of Teletext level 2.5 and 3.5, mentioned in section 15.4 in the standard).

@hurda Thanks. I will definitely see what other projects are doing if I fail to find a satisfactory explanation in those extra character sets.

— Reply to this email directly or view it on GitHub.

hurda commented 8 years ago

This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set.

Good catch! It's not really helping that the first searchengine-results when searching for "ets 300 706" are for the 1997-version of the spec. In telxcc.c the 1997-spec is referenced, but not 2003. Here's the link http://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf

PS: It's actually clause 12.3.4, at the bottom of table 29.

cfsmp3 commented 8 years ago

This seems like the best possible explanation. It's a one liner fix probably. Points will be awarded to the first GSoC applicant that sends a proper PR :-)

On Mon, Mar 14, 2016 at 11:25 AM, hurda notifications@github.com wrote:

This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set.

Good catch! It's not really helping that the first searchengine-results when searching for "ets 300 706" are for the 1997-version of the spec. In telxcc.c the 1997-spec is referenced, but not 2003. Here's the link http://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf

PS: It's actually clause 12.3.4, at the bottom of table 29.

— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-196243122 .

abhishek-vinjamoori commented 8 years ago

According to the standards mentioned when the packet is X/26 only, the "" = 42 must be replaced with "@" - if(y== 26) //But currently the \ is addressed at y=22 { //And Mode Description = 10000 and Data = 0101010. if(data == 64 && mode == 0x10) { ctx->page_buffer.text[i]j] = 0x0040; } }

According to current situation of decoding-

if(data == 10 && mode == 2 && ctx->page_buffer.text[y][k] == 42 && default_g0_charset == LATIN) ctx->page_buffer.text[y][k] = 0x0040; //Special case only for @ k is iterated from 0 to 39

abhishek-vinjamoori commented 8 years ago

I need a file where in "" is actually used. (With that it can be verified, whether the data/mode are different when \ actually appears)

hurda commented 8 years ago

I need a file where in "*" is actually used.

http://www.mediafire.com/download/apc078mz884gbkr/teletext_subtitles_with_asterisk_pcr-pts.ts

abhishek-vinjamoori commented 8 years ago

Could this be hosted somewhere else, as mediafire is blocked.

hurda commented 8 years ago

http://www111.zippyshare.com/v/Gc65zLfD/file.html

abhishek-vinjamoori commented 8 years ago

1 00:00:03,560 --> 00:00:06,160 schon wieder Streit, Doris.

2 00:00:10,360 --> 00:00:12,000

Is this the desired output ?

hurda commented 8 years ago

Have you omitted some lines for brevity?

Here are all subtitles of that sample-video:

1
00:00:03,560 --> 00:00:06,160
Mach nicht
schon wieder Streit, Doris.

2
00:00:06,240 --> 00:00:07,320
Grüße an Jesus!

3
00:00:07,400 --> 00:00:08,560
* Diana seufzt. *

4
00:00:08,640 --> 00:00:10,280
* Sie lässt den Motor an. *

5
00:00:10,360 --> 00:00:12,000
* Der Motor heult auf. *

6
00:00:14,200 --> 00:00:15,960
* schelmische Musik *

That's with ccextractor 0.79.

abhishek-vinjamoori commented 8 years ago

Yes. That is the output I'm getting. Is there any other file with "@" ? It would be really helpful.

hurda commented 8 years ago

I only got files with the same "untertitel*orf.at"-output.

abhishek-vinjamoori commented 8 years ago

But, are they different files ?

hurda commented 8 years ago

Yes. http://www36.zippyshare.com/v/BwLLxb8i/file.html

cfsmp3 commented 8 years ago

Solved in current github version.