CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
715 stars 425 forks source link

Extraction from bin file does not honor -unixts and -UCLA #667

Closed Liontooth closed 7 years ago

Liontooth commented 7 years ago

Create a bin file from a DVB transport stream:

ccextractor -ts -pn $PN -out=bin -o $FIL.bin $DIR/$FIL.$EXT

Extracting the text from this bin file:

ccextractor -in=bin -pn 53007 -tpage 891 -datets -ttxt -UCLA -noru -utf8 -parsepat -parsepmt -unixts 1485198721 -o 2017-01-23_1912_FR_TV5_Géopolitis.ccx.out 2017-01-23_1912_FR_TV5_Géopolitis.bin

results in wrong timestamps, a messed up third field, and an extra |:

19700101000109.360|19700101000112.520|CC?||Bonjour, bienvenue dans cette edition de Geopolitis.

while extraction from the transport stream produces the correct output:

20170123191310.360|20170123191313.520|891|Bonjour, bienvenue dans cette edition de Geopolitis.

Let me know if you need samples; this likely holds for any file.

cfsmp3 commented 7 years ago

Should be easy to fix. GSoC qualification: Solving this issue gives 2 points.

barun511 commented 7 years ago

Could I have samples please? I'll give this a shot.

cfsmp3 commented 7 years ago

You can probably use any of the teletext ones from here:

http://ccextractor.org/doku.php?id=public:general:tvsamples

On Fri, Jan 27, 2017 at 6:55 PM, Barun Parruck notifications@github.com wrote:

Could I have samples please? I'll give this a shot.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/667#issuecomment-275822533, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2Sh87tabtkfXcYBDD-0X91mBN_enks5rWq42gaJpZM4LtJvs .

barun511 commented 7 years ago

I can't seem to reproduce this. Is there a particular sample that you noticed this on?

Liontooth commented 7 years ago

http://vrnewsscape.ucla.edu/dropbox/2017-01-23_1912_FR_TV5_G%c3%a9opolitis.bin

cfsmp3 commented 7 years ago

Confirmed. I'll let GSoC applicants give it a go though since it's not too hard.

saurabhshri commented 7 years ago

Also,in this case (teletext) when extracting from bin it says No captions were found in input. and yield return code 10 even when they are extracted properly.

cfsmp3 commented 7 years ago

Please send fix for that :-)

On Tue, Feb 21, 2017 at 10:36 AM, Saurabh Shrivastava < notifications@github.com> wrote:

Also,in this case (teletext) when extracting from bin it says No captions were found in input. and yield return code 10 even when they are extracted properly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/667#issuecomment-281435381, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2egfJd4Ptt7fLB30GUAoGF_ELDr_ks5rey6mgaJpZM4LtJvs .

alexandrumc commented 7 years ago

@Liontooth, can you post here the DVB transport stream?

saurabhshri commented 7 years ago

@cfsmp3 @Liontooth While fixing, I am facing timing issues - I mean this :

From TS :

20170123191246.080|20170123191249.060|801|<font color="#00ffff">Disappearing? Are you sure, Mofy</font>
20170123191249.160|20170123191253.020|801|I've just seen it! I couldn't believe my eyes.
20170123191253.120|20170123191255.260|801|Mogu, your bag!

From .bin

20170123191245.980|20170123191248.960|801|<font color="#00ffff">Disappearing? Are you sure, Mofy?</font>
20170123191249.060|20170123191252.920|801|I've just seen it! I couldn't believe my eyes.
20170123191253.020|20170123191255.160|801|Mogu, your bag!

But then I found out that while using .bin few lines are missing too (See https://github.com/CCExtractor/ccextractor/issues/699 ).

Since timings are correct when extracted without -unixts, it must be something wrong at my part. I am trying to fix it. :)

saurabhshri commented 7 years ago

I was unnecessarily calculating deltas and all which had mistake somewhere. The solution was staring right in the face :P Timing is correct now (in the PR #700 ).