blackjack4494 / youtube-dlc

Command-line program to download various media from YouTube.com and other sites
https://blackjack4494.github.io/youtube-dlc/
The Unlicense
1.21k stars 13 forks source link

[YouTube] Rewrite #37

Open blackjack4494 opened 4 years ago

blackjack4494 commented 4 years ago

There are so many issues around the youtube extractor. One thing that always bothers me is that some component (yt extractor in this case) is constantly being updated to somehow work and fix issues. At some point it is better to rewrite specific components.

I found a few key points (in my opinion) that will enhance the extractor

  1. using pbj=1 in query parameters
  2. i forgot the second point
  3. rate limiting

Why would we want to use (1) pbj? When you are browsing on youtube and click on another video this query will automatically appended. It will only download a json with the necessary information about that new video instead of a big response. The json is as big as 50-400 kb (avg ~100-150) while a whole response can be multiple mb. Also this mimics the natural user behaviour so it should be less likely to be triggering some flags on youtube.

(3) Rate limiting. Yes I read a lot about the infamous 429 error many are getting. (1) may lead to less stricter rate limiting. However rate limiting is definitely needed or any fallback method (queuing failed videos with max retry counter?). I saw some approaches to use segmented / chunked downloads with certain parameters to avoid or lessen the likeliness of 429 appearing. This is another approach to mimic natural user behaviour (video buffers specific chunks and won't fully load out of obvious reasons). Google/Youtube is aware of the auto scraper and downloaders and tries to make it harder for them.

Though it's possible that I am able to do this on my own this is just something I had in my mind to share. So help is definitely wanted. Maybe it is easier to write a simple proof of concept (poc) like a single python file first (excluding necessary utils).

blackjack4494 commented 4 years ago

This is the absolute minimum that is needed to retrieve all informations.

curl 'https://www.youtube.com/watch?v=mhmGwTDpPf0&pbj=1' \ -H 'x-youtube-client-name: 1' \ -H 'x-youtube-client-version: 2.20200903.02.02' \ -H 'cookie: VISITOR_INFO1_LIVE=REDACTED; YSC=REDACTED' \ --compressed -v

The VISITOR_INFO1_LIVE and YSC cookies can be generated when sending a simple request to https://www.youtube.com

curl -v -I 'https://www.youtube.com'

< Set-Cookie: YSC=REDACTED; path=/; domain=.youtube.com; secure; httponly; samesite=None Set-Cookie: YSC=REDACTED; path=/; domain=.youtube.com; secure; httponly; samesite=None < Set-Cookie: VISITOR_INFO1_LIVE=REDACTED; path=/; domain=.youtube.com; secure; expires=Tue, 02-Mar-2021 21:40:12 GMT; httponly; samesite=None Set-Cookie: VISITOR_INFO1_LIVE=REDACTED; path=/; domain=.youtube.com; secure; expires=Tue, 02-Mar-2021 21:40:12 GMT; httponly; samesite=None

CeruleanSky commented 3 years ago

As long as there is a discussion of rewriting [youtube] perhaps more emphasis on downloading and parsing google's dash manifests could be kept in mind. This would allow recording from the beginning of livestreams, or at least the earliest time youtube has available for the stream.

https://github.com/ytdl-org/youtube-dl/issues/21255#issuecomment-504703485 shows a work around from last year that used ffmpeg to decode the mpd

Also there has been talk and suggestions on using ffmpeg more for seeking as well to get a particular time in a video without having to download the whole thing.

This has been an long standing issue since 2013 https://github.com/ytdl-org/youtube-dl/issues/622 which also had some suggested fixes along the way.

I am not sure if the current ffmpeg supports whatever google has been up to lately though

Streamlink also ran in to issues earlier this year with youtube's dash https://github.com/streamlink/streamlink/issues/2936

It seems it will be more and more difficult to record higher resolution streams and perhaps other things as well as google moves more features from HLS to its custom DASH.