Closed hurda closed 8 years ago
Addendum: Also affects other output-formats, like SAMI and TTXT.
when I try to download above file, it shows deleted
Please reopen when a working link is available.
I'm looking into this. That * is written to the buffer here:
ctx->page_buffer.text[y][i] = telx_to_ucs2(packet->data[i]);
packet->data[i] contains 42 (0x2a) which is indeed an asterisk http://www.columbia.edu/kermit/ucs2.html
Not able to Download the file. Can you pls re-upload the file?
It seems to work fine for me... is there an error message when trying to download the file?
On Wed, Mar 9, 2016 at 12:19 PM, Dhrumil2910 notifications@github.com wrote:
Not able to Download the file. Can you pls re-upload the file?
— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-194247124 .
It is because of the college proxy server which is denying the download
OK, I've uploaded it to slack. Maybe it can be downloaded from there?
On Wed, Mar 9, 2016 at 3:56 PM, Dhrumil2910 notifications@github.com wrote:
It is because of the college proxy server which is denying the download
— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-194332763 .
Ok , thanks a lot for your support
I'm trying to track that 0x2a byte back through the pipeline to find where things went wrong, if anywhere; but I need to know what the adequate value would be.
For this test video telx_to_ucs2() converts 0x40 to '§' rather than '@' because of the local language (German) substitutions in the basic character set specified in ETS 300-706. Is that OK?
@isacdaavid , ideally telx_to_ucs2() should get an input of 64 to get back "@". But, there is no possible input, for which the output is "@"(according to current decoding)
I think this bug is invalid after all. The asterisk is really there at offset 0xDBF164 in the file (value is 0x54 which is 0x2A in reverse endianess), and OP's software is responsible for outputting the at sign.
I followed that particular 0x2A back to tlt_process_pes_packet() where its endianess is reversed to 0x54, then after failing to find another transformation through several function calls and buffers all the way back until get_cinfo() I became suspicious that ccextractor had been doing the right thing all the time. Sure enough, I searched the binary file for "a7 37 54 f7" after printing the adjacent values according to ccextractor, and after changing that "54" to "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.
I can provide the hex-edited video and my debugging patch/pull request. If you like, I could also implement a change to telx_to_ucs2() that would output an at sign when it finds an asterisk, but I guess you don't want to introduce such behaviour. I'm interested in proving that I know some git and can make useful changes to your codebase, but I fear that if this bug gets closed without needing a patch then I will not have earned the points for my GSoC application :(
The issue is with supplementary charsets.
A good starting point to research this is google "supplementary charsets teletext". There's some other teletext applications around surely some of them get this right and we learn from them.
This is not about replacing one char with another generically (obviously that might work for this specific file but would break it for many others) but rather complete the supplementary charset implementation.
A good thing is that teletext specifications are public and totally free, so this also serve as an introduction to standard documents :-)
Notes to GSoC applicants
On Sat, Mar 12, 2016 at 5:06 AM, Isaac David notifications@github.com wrote:
I think this bug is invalid after all. The asterisk is really there at offset 0xDBF164 in the file (value is 0x54 which is 0x2A in reverse endianess), and OP's software is responsible for outputting the at sign.
I followed that particular 0x2A back to tlt_process_pes_packet() where its endianess is reversed to 0x54, then after failing to find another transformation through several function calls and buffers all the way back until get_cinfo() I became suspicious that ccextractor had been doing the right thing all the time. Sure enough, I searched the binary file for "a7 37 54 f7" after printing the adjacent values according to ccextractor, and after changing that "54" for "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.
I can provide the hex-edited video and my debugging patch/pull request. If you like, I could also implement a change to telx_to_ucs2() that would output an at sign when it finds an asterisk, but I guess you don't want to introduce such behaviour. I'm interested in proving that I know some git and can make useful changes to your codebase, but I fear that if this bug gets closed without needing a patch then I would not have earned the points for my GSoC application :(
— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-195655619 .
A good starting point to research this is google "supplementary charsets teletext". There's some other teletext applications around surely some of them get this right and we learn from them.
Extracting the subtitles using ProjectX 0.91.0.10 portable also outputs "@". Maybe it helps.
EDIT: http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/subtitle/Teletext.java?view=annotate http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/subtitle/CharSet.java?view=markup
http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/parser/StreamProcessTeletext.java?view=annotate http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/parser/StreamProcessSubpicture.java?view=annotate
EDIT2: VLC (2.2.2) shows @ too. http://git.videolan.org/?p=vlc.git;a=blob;f=modules/codec/telx.c;h=4f8842a95f4a94cb326d3e48234014852f04c235;hb=HEAD
To check this, you'll have to use this file: http://www.mediafire.com/download/7fwbqdw57sxykby/at-sign_teletext_pcr-pts.ts The other has a difference between the PCR-timestamps of A/V and Teletext of almost three hours, which VLC apparently can't handle.
EDIT3:
and after changing that "54" to "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.
While this changes what is being output by ccextractor, DVBViewer, ProjectX and VLC are still showing @. How's that possible?
Quick update: I couldn't find the @ in any of the supplementary (AKA G2) character sets. I still need to find more information on the second G0 sets (mentioned in section 15.3 in the ETS 300 706) and modified G0 and G2 sets (part of Teletext level 2.5 and 3.5, mentioned in section 15.4 in the standard).
@hurda Thanks. I will definitely see what other projects are doing if I fail to find a satisfactory explanation in those extra character sets.
EDIT: I found it. This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set. Quoting from it:
NOTE 3: The @ symbol replaces the * symbol at position 2/A when the table is accessed via a packet X/26 Column Address triplet with Mode Description = 10 000 and Data = 0101010. See clause 12.2.4.
Time to implement it!
It's in table 36 (Latin National Option subset), in the English row. Page 115 of ETS 300 706: May 1997
On Mon, Mar 14, 2016 at 5:30 AM, Isaac David notifications@github.com wrote:
Quick update: I couldn't find the @ in any of the supplementary (AKA G2) character sets. I still need to find more information on the second G0 sets (mentioned in section 15.3 in the ETS 300 706) and modified G0 and G2 sets (part of Teletext level 2.5 and 3.5, mentioned in section 15.4 in the standard).
@hurda Thanks. I will definitely see what other projects are doing if I fail to find a satisfactory explanation in those extra character sets.
— Reply to this email directly or view it on GitHub.
This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set.
Good catch! It's not really helping that the first searchengine-results when searching for "ets 300 706" are for the 1997-version of the spec. In telxcc.c the 1997-spec is referenced, but not 2003. Here's the link http://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf
PS: It's actually clause 12.3.4, at the bottom of table 29.
This seems like the best possible explanation. It's a one liner fix probably. Points will be awarded to the first GSoC applicant that sends a proper PR :-)
On Mon, Mar 14, 2016 at 11:25 AM, hurda notifications@github.com wrote:
This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set.
Good catch! It's not really helping that the first searchengine-results when searching for "ets 300 706" are for the 1997-version of the spec. In telxcc.c the 1997-spec is referenced, but not 2003. Here's the link http://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf
PS: It's actually clause 12.3.4, at the bottom of table 29.
— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/249#issuecomment-196243122 .
According to the standards mentioned when the packet is X/26 only, the "" = 42 must be replaced with "@" - if(y== 26) //But currently the \ is addressed at y=22 { //And Mode Description = 10000 and Data = 0101010. if(data == 64 && mode == 0x10) { ctx->page_buffer.text[i]j] = 0x0040; } }
According to current situation of decoding-
if(data == 10 && mode == 2 && ctx->page_buffer.text[y][k] == 42 && default_g0_charset == LATIN) ctx->page_buffer.text[y][k] = 0x0040; //Special case only for @ k is iterated from 0 to 39
I need a file where in "" is actually used. (With that it can be verified, whether the data/mode are different when \ actually appears)
I need a file where in "*" is actually used.
http://www.mediafire.com/download/apc078mz884gbkr/teletext_subtitles_with_asterisk_pcr-pts.ts
Could this be hosted somewhere else, as mediafire is blocked.
1 00:00:03,560 --> 00:00:06,160 schon wieder Streit, Doris.
2 00:00:10,360 --> 00:00:12,000
Is this the desired output ?
Have you omitted some lines for brevity?
Here are all subtitles of that sample-video:
1
00:00:03,560 --> 00:00:06,160
Mach nicht
schon wieder Streit, Doris.
2
00:00:06,240 --> 00:00:07,320
Grüße an Jesus!
3
00:00:07,400 --> 00:00:08,560
* Diana seufzt. *
4
00:00:08,640 --> 00:00:10,280
* Sie lässt den Motor an. *
5
00:00:10,360 --> 00:00:12,000
* Der Motor heult auf. *
6
00:00:14,200 --> 00:00:15,960
* schelmische Musik *
That's with ccextractor 0.79.
Yes. That is the output I'm getting. Is there any other file with "@" ? It would be really helpful.
I only got files with the same "untertitel*orf.at"-output.
But, are they different files ?
Solved in current github version.
CCExtractor 0.77 and git-677fee4 File: http://www.mediafire.com/download/q6ebvmzwe1prvi3/cce-at-sign.7z (25MB)
Teletext:
SRT: