Tyrrrz / YoutubeExplode

Abstraction layer over YouTube's internal API
MIT License
2.95k stars 493 forks source link

Some formats often can't be downloaded (example itag 136) #157

Closed Termiiii closed 6 years ago

Termiiii commented 6 years ago

I started using YouTubeExplode a few days ago (I think I have the newest version). I were using youtube-dl before, but I switched because your code is more readable (mainly because I prefer C# over python).

From what I can tell, you are getting the Download links of videos the same way youtube-dl gets them. But youtube-dl won't try to download the videos from their URLs if their URLs don't work. There are video formats that (I think always) get uploaded as DASH-segments (many small parts that have to be combined to get the full video). The 2 formats I have most experience with are 136 (DASH mp4 720p video only) and 137 (DASH mp4 1080p video only).

These formats sometimes have a working download URL and sometimes have not. If you want to consistently download these formats, you need to download the DASH-segments and combine them locally (youtube-dl does that). If I try to download the videoformat 136 of this video: https://www.youtube.com/watch?v=Lhw5xo67tdE with youtube-dl, it works. With YoutubeExplode, it does not.

I hope you can implement such a feature as well.

Here is code I used to test YoutubeExplodes behavior towards DASH-format 136 compared to a the normal format 22.

using System;
using YoutubeExplode;
using YoutubeExplode.Models.MediaStreams;

namespace ConsoleApp1{
    class Program{

        static void Main(string[] args){
            Program program = new Program();
            program.testFortmats();
            Console.ReadLine();
        }

        public async void testFortmats() {
            var videoId = "Lhw5xo67tdE";
            var client = new YoutubeClient();

            //getting information about all available formats (and more)
            var streamInfoSet = await client.GetVideoMediaStreamInfosAsync(videoId);
            var infos = streamInfoSet.GetAll();

            //walk through each individual format available
            foreach (MediaStreamInfo info in infos){
                //print the DownloadURL of format 22 (if available)
                if (info.Itag == 22){
                    Console.WriteLine("Download URL for format 22:");
                    Console.WriteLine(info.Url + "\n");
                }
                //print the DownloadURL of DASH-format 136 (if available)
                if (info.Itag == 136){
                    Console.WriteLine("Download URL for format 136:");
                    /*this URL sometimes works, sometimes does not. copy it into 
                    **your browser and test it. If it does not work in my browser, I cannot 
                    **download it either (I guess this behavior should be similar for you)*/
                    Console.WriteLine(info.Url + "\n"); 
                }
            }
        }
    }
}
Tyrrrz commented 6 years ago

Hi, Those segmented DASH streams that need to be combined, YoutubeExplode actually purposely skips them. https://github.com/Tyrrrz/YoutubeExplode/blob/c85bf3bc4b8ef12f8152372bf2a7a5aa2c84dc5a/YoutubeExplode/YoutubeClient.Video.cs#L399

Can you explain why would you want to download those streams?

Termiiii commented 6 years ago

For my project, I need to download (certain) new videos as fast as possible and analyze them with Tensorflow. The quicker, the better.

currently, I try to either download format 136(video only mp4 720p) or 22(muxed mp4 720p) with the download URL (from youtube-dl) by using ffmpeg. I am using ffmpeg to download the video in multiple .mp4 parts. Each part 1 minute long. That way Tensorflow can start working after around 2 seconds. Some YouTube videos are multiple hours long, it would be a waste if Tensorflow would need to wait a few minutes before it can start.

I could speed my project up by a lot if I could download format 136 into several .mp4 parts. Since format 136 often has no working download URL, I need to wait until format 22 is available, which slows down everything by a minutes on average.

Currently I am trying to learn python in order to understand what youtube-dl does. But some insight from you would be awesome as well. I read your reverse engineering-youtube guide half a year ago, that's why I am here. I initially thought YoutubeExplode would be able to download Dash-formats. As said before, I heavily prefer C# over python.

https://github.com/rg3/youtube-dl/blob/e06632e3fe25036b804a62469bb18fa4c37e3368/youtube_dl/downloader/dash.py

Termiiii commented 6 years ago

Those segmented DASH streams that need to be combined, YoutubeExplode actually purposely skips them.

Correct me if I am wrong. Only streams that have a working BaseURL and therefore dont need to be downloaded partially should be returned by GetVideoMediaStreamInfosAsync(videoId), right?

var streamInfoSet = await client.GetVideoMediaStreamInfosAsync(videoId);
// Skip partial streams
if (streamXml.Descendants("Initialization").FirstOrDefault()?.Attribute("sourceURL")?.Value
            .Contains("sq/") == true)
    continue;

// Extract values
var itag = (int) streamXml.Attribute("id");
var url = (string) streamXml.Element("BaseURL");
var bitrate = (long) streamXml.Attribute("bandwidth");

I can assure you, that streams with non-working BaseURLs (like some format 136-videos) are returned by GetVideoMediaStreamInfosAsync(videoId). For example this code would throw an unhandled exception:

public async void testStuff() {
    var videoId = "aPEFSbW0-po"; //2 min video
    var client = new YoutubeClient();

    //getting information about all available formats (and more)
    var streamInfoSet = await client.GetVideoMediaStreamInfosAsync(videoId);
    var infos = streamInfoSet.GetAll();

    //walk through each individuel format available
    foreach (MediaStreamInfo info in infos){
    //print and Download the BaseURL of format 136 (if available)
        if (info.Itag == 136){
            Console.WriteLine("Download URL for format 136:");
            Console.WriteLine(info.Url + "\n");
            Console.WriteLine("Downloading...");
            await client.DownloadMediaStreamAsync(info, "D:\\test.mp4");
        }
    }
}

Exception: System.Net.Http.HttpRequestException: "Response status code does not indicate success: 404 (Not Found)." in line:"await client.DownloadMediaStreamAsync(info, "D:\test.mp4");"

Tyrrrz commented 6 years ago

Yes, that's the idea. Although it's possible to implement downloading of such streams, I've never seen the point since there were easier alternatives.

If it throws an exception then it is indeed a bug and looks like that if condition that checks if a steam is partial needs to be updated.

Tyrrrz commented 6 years ago

Btw, is there a reason you are specifically interested in itag 136 and 137? Is it because it appears sooner than other formats?

Termiiii commented 6 years ago

Btw, is there a reason you are specifically interested in itag 136 and 137? Is it because it appears sooner than other formats?

Thats exactly it.

The only reason I am interested in 137, sometimes videos are uploaded in 1080p and the format is available as soon as the video is made public. I dont think I ever had a broken BaseURL for format 137. So in the case format 136 does not work, I am using format 137 (if available). Else I need to wait until either format 137 or format 22 is uploaded.

The above mentioned videoId: "aPEFSbW0-po" only has a broken BaseURL for format 133, 135 and 136. I dont know if I ever had a broken baseURL for format 134.

Tyrrrz commented 6 years ago

Okay, I see. So, first of all, I need to fix the filtering logic so that partial streams are not shown, because currently, YTE doesn't know how to download them. Then, I suppose, a separate issue needs to be created to add support for downloading of partial streams.

Termiiii commented 6 years ago

wow, awesome that you consider putting my problem on your TODO-list. If I should make progress myself, I will tell you asap. Dont expect anything soon though, I am struggling with small things more than I should.

Tyrrrz commented 6 years ago

Okay I just tested and I couldn't reproduce this. On the video you mentioned I couldn't find itag 136 (after many retries), but I tried with itag 137 and it downloaded perfectly fine.

Termiiii commented 6 years ago

I tried with itag 137 and it downloaded perfectly fine.

as said above, I didnt have problems with format 137 either (I put it on the same list as 136 because it might cause problems since they are very similiar). The formats (itags) that I regularly have/had problems with are: 133, 135 and 136.

On the video you mentioned I couldn't find itag 136 (until many retries)

Thats odd, I seem to find format 136 every time when using YoutubeExplode. I found format 136 exactly 100 out of 100 times. Here the (slow) code:

static void Main(string[] args){
    Program program = new Program();

    for (int i = 0; i < 100; i++) {
        program.printAllFormats();
    }

    Console.ReadLine();
}

//prints all formats(itags) that are available for the video (defined in the method)
public async void printAllFormats(){
    var videoId = "aPEFSbW0-po"; //2 min video  
    var client = new YoutubeClient();
    String s = "";

    //getting information about all available formats (and more)
    var streamInfoSet = await client.GetVideoMediaStreamInfosAsync(videoId);
    var infos = streamInfoSet.GetAll();

    //print which formats are available
    foreach (MediaStreamInfo info in infos){
        s += info.Itag + "  ";
    }
    Console.WriteLine(s);
}

But I am aware about that kind of behaviour (finding different itags when sending multiple requests). The Dev of youtube-dl commented the following:

We also try looking in get_video_info since it may contain different dashmpd URL that points to a DASH manifest with possibly different itag set (some itags are missing from DASH manifest pointed by webpage's dashmpd, some - from DASH manifest pointed by get_video_info's dashmpd). The general idea is to take a union of itags of both DASH manifests (for example video with such 'manifest behavior' see https://github.com/rg3/youtube-dl/issues/6093).

Source(line 1573-1578): https://github.com/rg3/youtube-dl/blob/e06632e3fe25036b804a62469bb18fa4c37e3368/youtube_dl/extractor/youtube.py

In fact, if you try to download these Dash Segments, you might want to keep in mind another thing that Dev said:

YouTube may often return 404 HTTP error for a fragment causing the whole download to fail. However if the same fragment is immediately retried with the same request data this usually succeeds (1-2 attemps is usually enough) thus allowing to download the whole file successfully. To be future-proof we will retry all fragments that fail with any HTTP error.

Source(line 54-59): https://github.com/rg3/youtube-dl/blob/e06632e3fe25036b804a62469bb18fa4c37e3368/youtube_dl/downloader/dash.py

Tyrrrz commented 6 years ago

Hm. I will have to keep trying then. I know sometimes I was able to get different itag sets one day apart.

Termiiii commented 6 years ago

I just updated my above comment one last time.

I know sometimes I was able to get different itag sets one day apart.

I previously wrote my own Youtube Downloader. But the behavior you mentioned made me switch to youtube-dl. YouTube is trying really hard to make reverse engineering hard. It is really nerve-wracking dealing with all these issues YouTube is throwing at you.

Tyrrrz commented 6 years ago

I finally caught itag 136 and it didn't have a working URL. It was actually not inside DASH manifest, but rather inside embedded adaptive streams, so it shouldn't be partial. Also, the content length property of that stream was 0 so it was a good giveaway that the stream was faulty. I'm not sure if it's an error on YouTube's side, but it has happened before that some streams just don't work.