CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
707 stars 422 forks source link

[GSoC] [PROPOSAL] Add YouTube-TV Captions Support. #931

Closed saurabhshri closed 4 years ago

saurabhshri commented 6 years ago

CCExtractor version (using the --version parameter preferably) : 0.87

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

Necessary information

One of the GSoC ideas we're planning is adding support for Live TV over the internet and YouTube TV is one of them. Read more about the project idea here : https://www.ccextractor.org/public:gsoc:livetvooverinternet.

We can use this issue to discuss about the progress and findings since there are no official specification as of yet.

Here's the summary of research so far :

  1. Captions of the recorded shows (DVR) are received by sending a request to /api/timedtext by the video player and the response is a XML file containing the timed transcription. These transcriptions can be rolled up normal, which I guess depends on parameters while sending the request.

Here's the sample request (I've removed sensitive information with @ symbol :

/api/timedtext?xorp=True&key=yttt1&sparams=asr_langs%2Ccaps%2Cv%2Cxorp%2Cexpire&v=@@@@@@&caps=asr&signature=@@@@@@@@@@@@@@@@@@@@@@.@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@&asr_langs=pt%2Cit%2Cfr%2Cnl%2Ces%2Cru%2Cko%2Cen%2Cde%2Cja&hl=en_GB&expire=1518827803&lang=es&name=CC1&fmt=srv3

and here's a sample response snippet :

<?xml version="1.0" encoding="utf-8" ?><timedtext format="3">
<head>
<pen id="1" i="1" fs="3"/>
<ws id="1" ju="0"/>
<wp id="3" ap="0" ah="0" av="0" cc="31"/>
<wp id="5" ap="0" ah="0" av="0" rc="1" cc="31"/>
<wp id="4" ap="0" ah="0" av="0" rc="2" cc="31"/>
<wp id="6" ap="0" ah="0" av="0" rc="3" cc="31"/>
<wp id="9" ap="0" ah="0" av="55" rc="3" cc="31"/>
<wp id="8" ap="0" ah="0" av="60" rc="2" cc="31"/>
<wp id="7" ap="0" ah="0" av="65" cc="31"/>
<wp id="2" ap="0" ah="0" av="65" rc="1" cc="31"/>
<wp id="1" ap="0" ah="0" av="70" cc="31"/>
<wp id="10" ap="0" ah="0" av="70" rc="2" cc="31"/>
</head>
<body>
<w t="16182" d="1869" id="1" wp="1" ws="1"/>
<p t="16182" d="1869" w="1" p="1">&quot;The Walking Dead&quot;...</p>
<w t="18051" d="1034" id="1" wp="2" ws="1"/>
<p t="18051" d="1034" w="1">Th-They were warning shots
above your head.</p>
<w t="19085" d="1735" id="1" wp="2" ws="1"/>
<p t="19085" d="1735" w="1">He wasn&#39;t shooting
at you.</p>
<w t="20820" d="4071" id="1" wp="1" ws="1"/>
<p t="20820" d="4071" w="1"><s>This place</s><s p="1">is</s><s>gonna fall.</s></p>
<w t="24891" d="2102" id="1" wp="2" ws="1"/>
<p t="24891" d="2102" w="1">I am not dying until
I am damn good and ready.</p>
<w t="26993" d="1335" id="1" wp="1" ws="1"/>
<p t="26993" d="1335" w="1">[Tires squeal]</p>
<w t="28328" d="3670" id="1" wp="1" ws="1"/>
<p t="28328" d="3670" w="1">[Walkers snarling]</p>

Here's the complete response attached : timedtext.txt.

I manually sniffed the traffic to obtain this, but of course we don't want to have a browser and check traffic manually and making everything automatic maybe hard. [On the related note : we had a GSoC project (created by @abhishek-vinjamoori) two years ago - which scraped/obtained subtitles from online services such as Netflix.]

  1. Captions in live shows are fetched as the broadcast progresses. I contacted YouTube TV support (who were fantastic and super supportive btw - they did not encounter such request before - but were happy to find out about it and reply) and their response basically was that the broadcast is a single stream, unaltered as received by them (but since it was not the developers I talked to, so we should probably not take the words and do our own research).

The live captions are fetched by sending a request to /api/manifest/rawcc and the response is some binary data prefixed by RCC.

Here's the sample request (I've removed sensitive information with @ symbol :

/api/manifest/rawcc?id=@@@@@@@@@&itag=133&source=yt_tv_broadcast&cmbypass=yes&ctier=UL&ei=@@@@@@@@@@@@@&gcr=us&hightc=yes&playlist_type=DVR&ratebypass=yes&live=1&cpn=@@@@@@@@@@&mpd_version=5&ip=199.116.72.167&ipbits=0&expire=1518825649&sparams=ip,ipbits,expire,id,itag,source,cmbypass,ctier,ei,gcr,hightc,playlist_type,ratebypass,live&signature=@@@@@@@@@@@@@@@@@@@@@@@@@.@@@@@@@@@@@@@@@@@@@@@&key=dg_yt0&alr=yes&c=WEB_UNPLUGGED&cver=0.1&sq=1329958

and here's a sample response snippet :

RCC{¼šEv{¿‰ÁÖ# H{Âwåò{ÅfE C#AV{ÈUyb{ËDÁ ƒ#E {Î3ïd{Ñ!ÄEÃ#A {Ôy {ÖþÁL#DE{ÙíLï{ÜÜ¿€C#AL{ßˏË{⺔/‚"?Ã$Šð‰{å¨{è—€€{놡€{îu€€{ñdË{ôR€€{÷A€€{ú/€€{ý{
€€{üЀ{뀀{ُ˜{È€€{·…{¦€€{•hì{ƒ€€{r—{`€€{ O…{#>€€{&-TÎ{)€€{,
T {.ù€€{1èWE{4×€€{7ÆÓT{:´€€{=£  {@‘€€{C€R{Fo€€{I^ƒ{LM€€{O;Óu{R*€€{Upå{X€€{Z÷òn{]倀{`Ôaô{c€€{f±uò{i €€{laì{o~€€{rl [{u[€€{xJÓÁ{{9€€{~(Ð]{€€{„, {†ó€€{‰âEv{ŒÑ€€{Àåò{’¯€€{•yb{˜Œ€€{›{ïd{žj€€{¡Yy {¤G€€{§6Lï{ª$€€{­Ë{°€€{²ñ{µà€€{¸Î¡€{»½€€{¾¬Ë{Á›€€{Ċ{Çx€€{ÊgЀ{ÍU€€{ÐD˜{Ó3€€{Ö"…{Ù€€{Ûÿhì{Þ{áݏ—{äÌ€€{ç»…{ê©€€{í˜TÎ{ð†€€{óuT {öd€€{ùSWE{üB€€{ÿ0ÓT{€€BŒð{,{ý€€{
ì†{
Ú€€{É  {·€€{¦R{•€€{„ƒ{s€€{"aÓu{%P€€{(?på{+.€€{.òn{1” {3úaô{6è”®B"Œð{9×uò{<Æ
”R…(œA)„Ã#’'{?µa쐑*{B¤ÙO{E’ [{HÕ C#YO{KpÓÁ{N_ÓTƒ#U {QNÐ]{T<ILÃ#ST{W+, {ZL #IL{]Ev{_÷ÎEC#L {bæåò{eÕEă#NE{hÃyb{k²
”pÃ#ED)œA)
{n¡  ïdC#’…'‘*{q—¡Ã#’{ty {wmTO{z\Lï{}J Ó#TO{€9Ë{ƒ(WEC# S{†€€{‰ETƒ#WE{‹ô{ŽãEÎÃ#ET{‘Ò¡€{”Á T#EN{—°Ë{šžÈEC# T{{ { Ѓ#HE{£jЀ{¦YOTÃ# P{©H˜{¬7®€#OT{¯%…{²”/B".ƒ$Š‰ð{µhì{·ò€€{ºá—{½Ï€€{À¾€€{쀀{ƛ…{Ɋ€€{ÌyTÎ{Ïh€€{ÒVT {ÕE€€{Ø4WE{Û#€€{ÞÓT{က{ãï  {æÝ€€{é̏R{컀€{廙{ò™€€{õ‡Óu{øv€€{ûepå{þT€€{Còn{1€€{ aô{
€€{ýuò{쀀ÂBŒð{Û,{Ê€€{¸ƒ{§€€{–aì{!…€€{$t [{'b€€{*QÓÁ{-?€€{0.Ð]{3€€{6, {8û€€{;éEv{>Ø€€{AÇåò{D¶€€{G¥yb{J“€€{M‚ïd{Pp€€{S_y {VN€€{Y=Lï{\,€€{_Ë{b  €€{dø{g瀀{jÖ¡€{mÄ€€{p³Ë{s¡€€{v{y€€{|nЀ{]€€{‚K˜{…:€€{ˆ)…{‹€€{Žhì{õ€€{“䏗{–Ò€€{™Á…{œ°€€{ŸŸTÎ{¢Ž€€{¥|T {¨k€€{«ZWE{®I€€{±8ÓT{´&€€{·  {º€€{¼òR{¿á€€{ÂЃ{Å¿€€{È­Óu{˜€€{΋på{Ñz€€{Ôiòn{×W€€{ÚFaô{Ý4€€{à#uò{ {æaì{èð€€{ëÞ [{îÍ€€{ñ¼ÓÁ{ô«€€{÷šÐ]{úˆ€€{ýw, {e€€{TEv{C€€{   2åò{!€€{yb{þ€€{íïd{Ü€€{Ëy {¹€€{ ¨Lï{#–€€{&…Ë{)t€€

The response is also attached as a binary file here : rawcc.txt .

Several such responses are received with time and the binary data probably contains the captions, but I haven't been able to make much sense of it.

After discussing with Carlos, I tried stripping the parity bit in each byte and show it in ASCII (applying &0x7f to each of them) , but the result did not make much sense to me - maybe I did something wrong. I'd be happy to provide more such received responses.

I also tried looking how they are handling this response, and they are (probably - not sure) are doing this using : https://s.ytimg.com/yts/jsbin/player_ias-vfliDr87C/en_GB/captions.js

I tried going through it, set few breakpoints, but the JS is not at all documented. All the variable names are letters and I don't really know (and have adequate time to figure out) what's going on there.

The XML for DVR captions can be also understood from the above attached JS.

Please feel free to add your findings and correct if I assumed something wrong.

meetDeveloper commented 6 years ago

I would like to work on this issue, but I currently don't have subscription of YouTube TV. I think I will be able to scrap the subtitles once I am able to look in their workings.

cfsmp3 commented 6 years ago

From what we're seeing - this is definitely not like scraping a web site.

On Sat, Feb 17, 2018 at 6:02 PM, meetDeveloper notifications@github.com wrote:

I would like to work on this issue, but I currently don't have subscription of YouTube TV. I think I will be able to scrap the subtitles once I am able to look in their workings.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366486593, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2RYzi6fu3arr5eeoZ8ssEOS8XCNaks5tV4S3gaJpZM4SIuS7 .

meetDeveloper commented 6 years ago

Yeah, maybe, I would like to try it out, and see the internal working, I went to the website to get the free trial, but they ask for card and I don't have one.

On Sun, Feb 18, 2018 at 11:26 PM, Carlos Fernandez Sanz < notifications@github.com> wrote:

From what we're seeing - this is definitely not like scraping a web site.

On Sat, Feb 17, 2018 at 6:02 PM, meetDeveloper notifications@github.com wrote:

I would like to work on this issue, but I currently don't have subscription of YouTube TV. I think I will be able to scrap the subtitles once I am able to look in their workings.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment- 366486593, or mute the thread https://github.com/notifications/unsubscribe-auth/ AFrJ2RYzi6fu3arr5eeoZ8ssEOS8XCNaks5tV4S3gaJpZM4SIuS7 .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366533993, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae8z7AbhzBKh0Xe8JexgaYGuv6LAGH8Hks5tWGREgaJpZM4SIuS7 .

cfsmp3 commented 6 years ago

@meetDeveloper you can collaborate with @saurabhshri for now... work with him for the initial research.

While we will provide a youtube subscription to accepted GSoC students, we cannot do it for anyone else I'm afraid.

meetDeveloper commented 6 years ago

Has student for GSoC already accepted

On Mon, Feb 19, 2018 at 5:41 AM, Carlos Fernandez Sanz < notifications@github.com> wrote:

@meetDeveloper https://github.com/meetdeveloper you can collaborate with @saurabhshri https://github.com/saurabhshri for now... work with him for the initial research.

While we will provide a youtube subscription to accepted GSoC students, we cannot do it for anyone else I'm afraid.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366561433, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae8z7HHDhLuma_32LhTEDV31tPkSJiGZks5tWLwlgaJpZM4SIuS7 .

meetDeveloper commented 6 years ago

@saurabhshri I wanted to study the source, I think I might able to decode how live shows are captioned, how should I go about it?

canihavesomecoffee commented 6 years ago

If you'd check out the GSoC timeline, you'd notice that we're about a month away from starting to receive the proposals for GSoC, let alone accept students.

saurabhshri commented 6 years ago

@meetDeveloper Sure! DM me on Slack. Same username.

saurabhshri commented 6 years ago

If someone wants to look at the raw data that is decoded for the captions, I am attaching it all below.

I played the live shows and captured the raw caption data received by the player, along with the captions it displayed during that time period. Decoding the response should yield at least some of the text in transcription.

The first two folders, named FIRST & SECOND contains multiple files, each file a separate request/response object. It contains everything need to be known about that transaction. The files TRANSCRIPTION contain the captions that were displayed and are typed exactly as seen.

The THIRD folder contains two files, RESPONSES which contains just the binary rawcc data sent. Each response is marked by RCC in the beginning. The captions displayed during that time are in TRANSCRIPTION file.

The captions is the first two directories is broadcasted with a delay (probably captioned live), while in 3rd they are perfectly in sync.

https://drive.google.com/open?id=1H_npv96SpiJJhAbFtaJ9HMV7V1SfAg5r

meetDeveloper commented 6 years ago

@cfsmp3 I did some research, They are most probably CEA-608, or CEA-708 subtitles, as network YoutubeTV supports sends captions for live tv using these two standards, and google has created a ExoPlayer library for media players, that have functionality to extract subtitles from rawcc file, look at this, so most probably what youtube is doing is getting those live caption provided by network, which are in either CEA-608, or CEA-708, decoding them and displaying them.

cfsmp3 commented 6 years ago

I'd expect them to be CEA-608 and 708 at their source, but I'd say it's unlikely (but I wouldn't bet either way) that that's what the players actually get. I guess we'll need to analyze data dumps to find out for sure.

On Mon, Feb 19, 2018 at 9:40 PM, meetDeveloper notifications@github.com wrote:

I did some research, They are most probably CEA-608, or CEA-708 subtitles, as network YoutubeTV supports sends captions for live tv using these two standards, and google has created a ExoPlayer library for media players, that have functionality to extract subtitles from rawcc file, look at this https://google.github.io/ExoPlayer/doc/reference/index.html?com/google/android/exoplayer2/extractor/rawcc/package-summary.html, so most probably what youtube is doing is getting those live caption provided by network, which are in either CEA-608, or CEA-708, decoding them and displaying them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366872489, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2QWSbEFePoOr1nSNlpirqvnGrDxXks5tWlrpgaJpZM4SIuS7 .

meetDeveloper commented 6 years ago

Look at this exoplayer that google is experimenting with https://google.github.io/ExoPlayer/doc/reference/index.html?com/google/android/exoplayer2/extractor/rawcc/package-summary.html, it can extract RawCC.

saurabhshri commented 6 years ago

@meetDeveloper This looks just like the right thing. Glancing at the code I can see it looks for RCC as header ID, which matches with captured data. So, we know we're definitely going in right direction.

Did you try to see if you're able to get the matching transcription from the dumps?

meetDeveloper commented 6 years ago

@saurabhshri Yeah, I looked in the code earlier, they are looking for rcc in the header, I am currently analyzing it and performing experiments, hopefully we will get a lead from here. I think we are in right direction.

meetDeveloper commented 6 years ago

@saurabhshri See this dump of rawcc that they use, extension was rawcc, this was exactly what we also got, I am now sure they are the same, please tell me what do you think. sample.txt

meetDeveloper commented 6 years ago

@saurabhshri @cfsmp3 I would like to do this as my summer project in gSoc, Could you tell me the steps?

cfsmp3 commented 6 years ago

@meetDeveloper Come find us on slack :-)

cfsmp3 commented 4 years ago

Closing, since Youtube now support most of the formats we export.