Closed saurabhshri closed 4 years ago
I would like to work on this issue, but I currently don't have subscription of YouTube TV. I think I will be able to scrap the subtitles once I am able to look in their workings.
From what we're seeing - this is definitely not like scraping a web site.
On Sat, Feb 17, 2018 at 6:02 PM, meetDeveloper notifications@github.com wrote:
I would like to work on this issue, but I currently don't have subscription of YouTube TV. I think I will be able to scrap the subtitles once I am able to look in their workings.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366486593, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2RYzi6fu3arr5eeoZ8ssEOS8XCNaks5tV4S3gaJpZM4SIuS7 .
Yeah, maybe, I would like to try it out, and see the internal working, I went to the website to get the free trial, but they ask for card and I don't have one.
On Sun, Feb 18, 2018 at 11:26 PM, Carlos Fernandez Sanz < notifications@github.com> wrote:
From what we're seeing - this is definitely not like scraping a web site.
On Sat, Feb 17, 2018 at 6:02 PM, meetDeveloper notifications@github.com wrote:
I would like to work on this issue, but I currently don't have subscription of YouTube TV. I think I will be able to scrap the subtitles once I am able to look in their workings.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment- 366486593, or mute the thread https://github.com/notifications/unsubscribe-auth/ AFrJ2RYzi6fu3arr5eeoZ8ssEOS8XCNaks5tV4S3gaJpZM4SIuS7 .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366533993, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae8z7AbhzBKh0Xe8JexgaYGuv6LAGH8Hks5tWGREgaJpZM4SIuS7 .
@meetDeveloper you can collaborate with @saurabhshri for now... work with him for the initial research.
While we will provide a youtube subscription to accepted GSoC students, we cannot do it for anyone else I'm afraid.
Has student for GSoC already accepted
On Mon, Feb 19, 2018 at 5:41 AM, Carlos Fernandez Sanz < notifications@github.com> wrote:
@meetDeveloper https://github.com/meetdeveloper you can collaborate with @saurabhshri https://github.com/saurabhshri for now... work with him for the initial research.
While we will provide a youtube subscription to accepted GSoC students, we cannot do it for anyone else I'm afraid.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366561433, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae8z7HHDhLuma_32LhTEDV31tPkSJiGZks5tWLwlgaJpZM4SIuS7 .
@saurabhshri I wanted to study the source, I think I might able to decode how live shows are captioned, how should I go about it?
If you'd check out the GSoC timeline, you'd notice that we're about a month away from starting to receive the proposals for GSoC, let alone accept students.
@meetDeveloper Sure! DM me on Slack. Same username.
If someone wants to look at the raw data that is decoded for the captions, I am attaching it all below.
I played the live shows and captured the raw caption data received by the player, along with the captions it displayed during that time period. Decoding the response should yield at least some of the text in transcription.
The first two folders, named FIRST
& SECOND
contains multiple files, each file a separate request/response object. It contains everything need to be known about that transaction. The files TRANSCRIPTION
contain the captions that were displayed and are typed exactly as seen.
The THIRD
folder contains two files, RESPONSES
which contains just the binary rawcc data sent. Each response is marked by RCC
in the beginning. The captions displayed during that time are in TRANSCRIPTION
file.
The captions is the first two directories is broadcasted with a delay (probably captioned live), while in 3rd they are perfectly in sync.
https://drive.google.com/open?id=1H_npv96SpiJJhAbFtaJ9HMV7V1SfAg5r
@cfsmp3 I did some research, They are most probably CEA-608, or CEA-708 subtitles, as network YoutubeTV supports sends captions for live tv using these two standards, and google has created a ExoPlayer library for media players, that have functionality to extract subtitles from rawcc file, look at this, so most probably what youtube is doing is getting those live caption provided by network, which are in either CEA-608, or CEA-708, decoding them and displaying them.
I'd expect them to be CEA-608 and 708 at their source, but I'd say it's unlikely (but I wouldn't bet either way) that that's what the players actually get. I guess we'll need to analyze data dumps to find out for sure.
On Mon, Feb 19, 2018 at 9:40 PM, meetDeveloper notifications@github.com wrote:
I did some research, They are most probably CEA-608, or CEA-708 subtitles, as network YoutubeTV supports sends captions for live tv using these two standards, and google has created a ExoPlayer library for media players, that have functionality to extract subtitles from rawcc file, look at this https://google.github.io/ExoPlayer/doc/reference/index.html?com/google/android/exoplayer2/extractor/rawcc/package-summary.html, so most probably what youtube is doing is getting those live caption provided by network, which are in either CEA-608, or CEA-708, decoding them and displaying them.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/931#issuecomment-366872489, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2QWSbEFePoOr1nSNlpirqvnGrDxXks5tWlrpgaJpZM4SIuS7 .
Look at this exoplayer that google is experimenting with https://google.github.io/ExoPlayer/doc/reference/index.html?com/google/android/exoplayer2/extractor/rawcc/package-summary.html, it can extract RawCC.
@meetDeveloper This looks just like the right thing. Glancing at the code I can see it looks for RCC
as header ID, which matches with captured data. So, we know we're definitely going in right direction.
Did you try to see if you're able to get the matching transcription from the dumps?
@saurabhshri Yeah, I looked in the code earlier, they are looking for rcc in the header, I am currently analyzing it and performing experiments, hopefully we will get a lead from here. I think we are in right direction.
@saurabhshri See this dump of rawcc that they use, extension was rawcc, this was exactly what we also got, I am now sure they are the same, please tell me what do you think. sample.txt
@saurabhshri @cfsmp3 I would like to do this as my summer project in gSoc, Could you tell me the steps?
@meetDeveloper Come find us on slack :-)
Closing, since Youtube now support most of the formats we export.
CCExtractor version (using the --version parameter preferably) : 0.87
In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
What were the used arguments?
-autoprogram
One of the GSoC ideas we're planning is adding support for Live TV over the internet and YouTube TV is one of them. Read more about the project idea here : https://www.ccextractor.org/public:gsoc:livetvooverinternet.
We can use this issue to discuss about the progress and findings since there are no official specification as of yet.
Here's the summary of research so far :
/api/timedtext
by the video player and the response is a XML file containing the timed transcription. These transcriptions can be rolled up normal, which I guess depends on parameters while sending the request.Here's the sample request (I've removed sensitive information with
@
symbol :and here's a sample response snippet :
Here's the complete response attached : timedtext.txt.
I manually sniffed the traffic to obtain this, but of course we don't want to have a browser and check traffic manually and making everything automatic maybe hard. [On the related note : we had a GSoC project (created by @abhishek-vinjamoori) two years ago - which scraped/obtained subtitles from online services such as Netflix.]
The live captions are fetched by sending a request to
/api/manifest/rawcc
and the response is some binary data prefixed byRCC
.Here's the sample request (I've removed sensitive information with
@
symbol :and here's a sample response snippet :
The response is also attached as a binary file here : rawcc.txt .
Several such responses are received with time and the binary data probably contains the captions, but I haven't been able to make much sense of it.
After discussing with Carlos, I tried stripping the parity bit in each byte and show it in ASCII (applying &0x7f to each of them) , but the result did not make much sense to me - maybe I did something wrong. I'd be happy to provide more such received responses.
I also tried looking how they are handling this response, and they are (probably - not sure) are doing this using : https://s.ytimg.com/yts/jsbin/player_ias-vfliDr87C/en_GB/captions.js
I tried going through it, set few breakpoints, but the JS is not at all documented. All the variable names are letters and I don't really know (and have adequate time to figure out) what's going on there.
The XML for DVR captions can be also understood from the above attached JS.
Please feel free to add your findings and correct if I assumed something wrong.