CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
717 stars 426 forks source link

False + True entries for each subtitle from one TV channel #562

Closed PFDuke closed 7 years ago

PFDuke commented 7 years ago

When I record TV from Australia's SBS channel and extract the subtitles to an SRT file, each subtitle has two entries. The first entry is false and has zero or near zero length while the second entry is correct. The text is identical. Since the false entries are so short, in practice they don't cause any problems, except to double the length of the SRT file and to make editing a bit more complicated. The main problem is that I have a tidy mind... :(

This problem does not occur with the other TV channels that I have tried.

I have fiddled with a few settings in CCExtractor GUI, but they didn't help.

Two examples are stored in my Dropbox account: https://www.dropbox.com/sh/6mpifvmm6ofw2so/AABcHbTWNlGv0t4Sr28OrkKVa?dl=0

saurabhshri commented 7 years ago

@PFDuke Would you please share the exact command you used to extract the subtitles? If you are using the GUI version, the command appears in the box below.

PFDuke commented 7 years ago

I am using the GUI.

I was hoping that you could tell me if there is a problem with the sample video files from this TV channel, and if you could tell me how to avoid the duplicated subtitle entries.

Have you tried to extract .srt subtitles from my files?

Peter Duke

------ Original Message ------ From: "Saurabh Shrivastava" notifications@github.com To: "CCExtractor/ccextractor" ccextractor@noreply.github.com Cc: "PFDuke" pp.duke@bigpond.com; "Mention" mention@noreply.github.com Sent: 30/12/2016 7:43:19 PM Subject: Re: [CCExtractor/ccextractor] False + True entries for each subtitle from one TV channel (#562)

@PFDuke https://github.com/PFDuke Would you please share the exact command you used to extract the subtitles? If you are using the GUI version, the command appears in the box below.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/562#issuecomment-269746485, or mute the thread https://github.com/notifications/unsubscribe-auth/AXkNhlMrPXPle98qkJIw3CGpnevPM7BGks5rNMQngaJpZM4LTqUd.

saurabhshri commented 7 years ago

@PFDuke Yes I did, and indeed the two entries exist. I'll spend the day reading about SBS channel's CC specifications to see if it's intentional or there's some bug. Please do not remove the samples in the meantime. :)

PFDuke commented 7 years ago

Thanks

I am hoping that there is some straight forward way to prevent the double entries, but if not there is always the text editor. :)

Peter

------ Original Message ------ From: "Saurabh Shrivastava" notifications@github.com To: "CCExtractor/ccextractor" ccextractor@noreply.github.com Cc: "PFDuke" pp.duke@bigpond.com; "Mention" mention@noreply.github.com Sent: 2/01/2017 8:20:12 PM Subject: Re: [CCExtractor/ccextractor] False + True entries for each subtitle from one TV channel (#562)

@PFDuke https://github.com/PFDuke Yes I did, and indeed the two entries exist. I'll spend the day reading about SBS channel's CC specifications to see if it's intentional or there's some bug. Please do not remove the samples in the meantime. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/562#issuecomment-269948386, or mute the thread https://github.com/notifications/unsubscribe-auth/AXkNhiGUJIzKgeN3cII-YdCKYDZcsoFYks5rOMFMgaJpZM4LTqUd.

saurabhshri commented 7 years ago

@PFDuke We can always add an option (which can be enabled through parameter) to completely remove the subtitles if the length of them is zero. :) But I'll be keeping this as the last resort.

saurabhshri commented 7 years ago

@PFDuke I thoroughly searched the web to find SBS's subtitling specification. While they boast of their state of the start subtitles, I couldn't find those specifications anywhere. Also, they haven't replied to my mail yet.

I was able to solve the issue though.

@cfsmp3 What according to you should be the ideal behaviour? Do you want me to add an extra parameter which will do this job? If yes, what should I name it? (-nozerolength ? )

Or, I should make this change permanent and always ignore the subtitle with same starting and ending timestamp (zero length)?

I think we should simply add a new parameter for the people who need it. This will help in conserving subtitle information which might be needed by some people.

PFDuke commented 7 years ago

I wonder whether this behaviour has something to do with the use of their live captioning equipment. I have included the start of the Vienna New Year's Day Concert, which was live captioned. You will see that the captions are delayed and build up progressively. The two captioning examples I first gave you did not have to be generated in real time, so there is no great delay, but the zero length entry may be a hangover from using the same equipment. Just a guess, however.

I feel awkward about asking for special treatment if no one else is worried about my problem. In any case I think I should first look at some more examples to see how predictable it is.

Thanks and Happy New Year yourself.

Peter Duke.

------ Original Message ------ From: "Saurabh Shrivastava" notifications@github.com To: "CCExtractor/ccextractor" ccextractor@noreply.github.com Cc: "PFDuke" pp.duke@bigpond.com; "Mention" mention@noreply.github.com Sent: 3/01/2017 8:06:51 AM Subject: Re: [CCExtractor/ccextractor] False + True entries for each subtitle from one TV channel (#562)

@PFDuke https://github.com/PFDuke I thoroughly searched the web to find SBS's subtitling specification. While they boast of their state of the start subtitles, I couldn't find those specifications anywhere else.

I was able to solve the issue though.

@cfsmp3 https://github.com/cfsmp3 What according to you should be the ideal behaviour? Do you want me to add an extra parameter which will do this job? If yes, what should I name it? (-nozerolength ? )

Or, I should make this change permanent and always ignore the subtitle with same starting and ending timestamp (zero length)?

I think we should simply add a new parameter for the people who need it. This will help in conserving subtitle information which might be needed by some people.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/562#issuecomment-270020315, or mute the thread https://github.com/notifications/unsubscribe-auth/AXkNho1QrqrXIySCf4U1MHFZuSqzUCJQks5rOWbrgaJpZM4LTqUd.

saurabhshri commented 7 years ago

@PFDuke It's completely fine, and thanks for filing the bug. The issue appears to be deeper than I thought. If you could provide few more samples it will be great.

In the meantime, I have made a quick patch which should remove those "zero length" subtitles for you. If you are comfortable compiling your own version, here's the patch : https://github.com/saurabhshri/ccextractor/tree/BugFix

Simply clone this, and build. Use parameter -nonzerolength to remove those zero length subtitles.

./ccextractor elephant.ts -nonzerolength