Word-by-word tagging - Githubissues

CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team

https://www.ccextractor.org

GNU General Public License v2.0

717 stars 426 forks source link

Word-by-word tagging #120

Closed brannondorsey closed 10 years ago

brannondorsey commented 10 years ago

Hi there,

I am interested in identifying the precise in and out timestamps of specific words embedded in the closed caption data of an mpeg2 stream. It seems that with CCExtractor, only lines of text are indexed in this way. Is this a limitation of CCExtractor specifically, or the standards of CC in digital broadcast? If this functionality is not directly built into CCExtractor, would you have any suggestions as to how to extract and use this very specific data?

anshul1912 commented 10 years ago

It is possible only if the video have word by word timing inside, do you have some video where each words of some lines are shown at different time.

like below caption "This is caption" if first "This" on x second is shown then "is" on x+1 second shown then "caption" on x+2 second is shown

I have never seen a video like that, though people do make cc in such a way where they show "This" then "This is" then "This is caption", here data is redundant but people's are using it for effect.

In closed caption timing is generally taken from PES packet which contain closed caption, so if your each word is in different PES packet then you can get that timing, I don't think there are any sane closed caption encoder who display one word at a time with each frame, that decrease readability of those statement and it would not be useful too.

can you elaborate why are you interested in identifying the precise in and out timestamps of specific words

cfsmp3 commented 10 years ago

For captions transmitted in roll-up we could have word-by-word timing (since characters are displayed as received); however in roll-up, which is used mostly for newscasts and other content transcribed in real time, there's no lipsync (captions are at least a couple seconds behind audio) so there's no value in doing that either.

On Sun, Nov 9, 2014 at 8:35 AM, Brannon Dorsey notifications@github.com wrote:

Hi there,

I am interested in identifying the precise in and out timestamps of specific words embedded in the closed caption data of an mpeg2 stream. It seems that with CCExtractor, only lines of text are indexed in this way. Is this a limitation of CCExtractor specifically, or the standards of CC in digital broadcast? If this functionality is not directly built into CCExtractor, would you have any suggestions as to how to extract and use this very specific data?

— Reply to this email directly or view it on GitHub https://github.com/CCExtractor/ccextractor/issues/120.

brannondorsey commented 10 years ago

Hi all, Thank you both for your timely responses! It seems as if no closed captioning will provide me with the level of control that I am looking for in tagging words in television programs. I am looking to create a database of precise in and out points of words in network TV and Movies. That said, CCExtractor is a really fine piece of software.