Tyrrrz / YoutubeExplode

Abstraction layer over YouTube's internal API
MIT License
2.87k stars 481 forks source link

Links are cut off in video description from GetVideoInfoAsync #160

Closed SlowLogicBoy closed 5 years ago

SlowLogicBoy commented 5 years ago

For example in this video description there are quite a few links but they are cut off, I would like to somehow get those hyperlinks. Something like string RawDescription property? without that .TextEx() thingy? Raw Html is fine with me.

Tyrrrz commented 5 years ago

If I remember correctly, they are cut off by YouTube and the href is rewritten to go through YouTube's redirect.

SlowLogicBoy commented 5 years ago

I see, I will need to debug this to see if I could get out of that something, if not oh well.

SlowLogicBoy commented 5 years ago
void EscapeYoutubeHyperlinks(IElement descriptionNode)
{
    var hyperLinks = descriptionNode.GetElementsByTagName("a");
    foreach (var hyperlink in hyperLinks.Cast<IHtmlAnchorElement>())
    {
        if (string.IsNullOrWhiteSpace(hyperlink.Search))
            continue;

        var queryParts = hyperlink.Search.Split('&', StringSplitOptions.RemoveEmptyEntries);
        var url = queryParts.SingleOrDefault(s => s.StartsWith("q="))?.Substring(2);

        if(string.IsNullOrWhiteSpace(url))
            url = queryParts.SingleOrDefault(s => s.StartsWith("?q="))?.Substring(3);

        if (string.IsNullOrWhiteSpace(url)) 
            continue;

        url = Uri.UnescapeDataString(url);
        hyperlink.Href = url;
        hyperlink.Dataset["url"] = url;
    }
}
//Usage:
var descriptionNode = watchPage.QuerySelector("p#eow-description");
EscapeYoutubeHyperlinks(descriptionNode);

this changes from:

<a href="/redirect?redir_token=y3v1wFxmzoIFfMuVsvc86NSD7UF8MTUzOTMzMDQ3M0AxNTM5MjQ0MDcz&amp;q=http%3A%2F%2Freol.jp&amp;event=video_description&amp;v=EFTV3IIjeNw" class="yt-uix-sessionlink  " data-target-new-window="True" data-sessionlink="itct=CDQQ6TgYACITCKXvvoHz_d0CFdOdwQodFM0L1Sj4HUjc8Y2RyLu1qhA" data-url="/redirect?redir_token=y3v1wFxmzoIFfMuVsvc86NSD7UF8MTUzOTMzMDQ3M0AxNTM5MjQ0MDcz&amp;q=http%3A%2F%2Freol.jp&amp;event=video_description&amp;v=EFTV3IIjeNw" target="_blank" rel="nofollow noopener">http://reol.jp</a>

Into:

<a href="http://reol.jp" class="yt-uix-sessionlink  " data-target-new-window="True" data-sessionlink="itct=CDQQ6TgYACITCKXvvoHz_d0CFdOdwQodFM0L1Sj4HUjc8Y2RyLu1qhA" data-url="http://reol.jp" target="_blank" rel="nofollow noopener">http://reol.jp</a>
Tyrrrz commented 5 years ago

How is this link rendered now, without the proposed solution?

SlowLogicBoy commented 5 years ago

The current link is: /redirect?redir_token=y3v1wFxmzoIFfMuVsvc86NSD7UF8MTUzOTMzMDQ3M0AxNTM5MjQ0MDcz&amp;q=http%3A%2F%2Freol.jp&amp;event=video_description&amp;v=EFTV3IIjeNw

which is as you said youtube redirect link.

So I convert from that url to: http://reol.jp

because they save the original url in q=http%3A%2F%2Freol.jp and I just do Url Decode which changes to q=http://reol.jp .

Tyrrrz commented 5 years ago

Can you try and see if it works on all videos? Particularly interested in videos with really long links in the description. I remember there were two types of encoding that they used.

SlowLogicBoy commented 5 years ago

From this video: https://www.youtube.com/watch?v=2mmZZEUbM4I By using that code I got:

descriptionNode.GetElementsByTagName("a").Cast<IHtmlAnchorElement>().Select(a => a.Href).ToList():
[0] [string]:"https://djs3rl.com/shop/Like-This"
[1] [string]:"https://itunes.apple.com/au/album/like-this-feat.-krystal-single/id1196696115"
[2] [string]:"https://play.spotify.com/album/0q4Y9yzlqlTwchSYC6VzVq"
[3] [string]:"https://play.google.com/store/music/album?id=Bauz32emv4rdpfyo3xhcxtqpa6e&tid=song-Tmfsybxlezihqdx4nr5d25orj3e"
[4] [string]:"http://classic.beatport.com/release/like-this-dj-edit/1933795"
[5] [string]:"https://www.trackitdown.net/track/s3rl-feat-krystal/like-this-dj-edit/hardcore/11083573.html"
[6] [string]:"https://soundcloud.com/s3rl/like-this-s3rl-feat-krystal"
[7] [string]:"https://osu.ppy.sh/s/566554"
[8] [string]:"https://www.youtube.com/user/SlenderTheMan22"
[9] [string]:"https://www.youtube.com/channel/UCv2mQRbD_rWDMbIL-scMcgw"
[10] [string]:"https://www.instagram.com/kazumi_mai/"
[11] [string]:"https://djs3rl.com/"

There were some cut off urls, but since I decode redirects, I get full urls. Note:

Tyrrrz commented 5 years ago

I'm trying to refactor the architecture a bit to make it easier to implement this. https://github.com/Tyrrrz/YoutubeExplode/tree/refactor-parsers work in progress

Tyrrrz commented 5 years ago

Done. The format hasn't changed, but the links are now never cut off in description.