HermanFassett / youtube-scrape

Scrape YouTube searches (API)
MIT License
193 stars 96 forks source link

The scrapper fails more ofter as of recently #33

Closed MhmdSalah closed 3 years ago

MhmdSalah commented 3 years ago

Hello! recently, i noticed that the scrapper fails more often. Here is a video showing what i mean: https://www.dropbox.com/s/bfn858cbaqo7b6m/video%20for%20youtube%20scrapper.mov?dl=0 When i have time, i will try to investigate this issue.

MhmdSalah commented 3 years ago

I had some time on my hands and did some tests.

It seems like the html returned now comes in two (or more) types. i made two files one. One contains html that is parsed successfully using your current method, called Goodhtml, and one that fails, called Badhtml.

here : https://www.dropbox.com/s/mwv8fbe6ctx9y4x/ytseachhtmlexamples.zip?dl=0

MhmdSalah commented 3 years ago

Funny thing is, in the Badhtml (the one that fails) it has comments about scrappers. Take a look at the file and look for

HermanFassett commented 3 years ago

Alright, yeah it looks like a different format of html.

window["ytInitialData"] = {
   ...
};
window["ytInitialPlayerResponse"] = null;

vs

var ytInitialData = {
   ...
};

So looks like there need to be two methods or a more generic way of grabbing data from the html file to parse

cosminadrianpopescu commented 3 years ago

And it also seems that the page parameter is not working anymore. It works only now and then, but more often not...

HermanFassett commented 3 years ago

And it also seems that the page parameter is not working anymore. It works only now and then, but more often not...

That's interesting, from what I see, YouTube search results have infinite loading, but the page query still works on their website.

cosminadrianpopescu commented 3 years ago

Yes, but if you try several times, finally will fail. It also works for me from time to time, but eventually it fails.

SoulHarsh007 commented 3 years ago

There is a temporary fix available.

html
  .split("ytInitialData = ")[1]
  .split("</script>")[0]
  .replace("// scraper_data_end", "")
  .replace(";", "");

this will be returning ytInitialData as string it can be parsed with JSON.parse(). This only works for response sent by reworked html^

cosminadrianpopescu commented 3 years ago

Yes, but keep in mind that at the moment youtube pages also return the old version. So basically, you need to test before which version you have and then apply one pattern or the other.

        const txt = html.replace(/[\n\r]/g, '');
        const p1 = /^.*ytInitialData"[ ]*\][ ]*=[ ]*(.*);[ ]*window[ ]*\["ytInitialPlayerResponse"\].*$/;
        const p2 = /^.*ytInitialData[ ]*=[ ]*(.*);[ ]*\/\/ scraper_data_end.*$/;
        let jsonTxt = txt.replace(p1, '$1');
        if (!txt.match(p1)) {
            jsonTxt = txt.replace(p2, '$1');
        }
        const json = JSON.parse(jsonTxt);
HermanFassett commented 3 years ago

Yes, but keep in mind that at the moment youtube pages also return the old version. So basically, you need to test before which version you have and then apply one pattern or the other.

        const txt = html.replace(/[\n\r]/g, '');
        const p1 = /^.*ytInitialData"[ ]*\][ ]*=[ ]*(.*);[ ]*window[ ]*\["ytInitialPlayerResponse"\].*$/;
        const p2 = /^.*ytInitialData[ ]*=[ ]*(.*);[ ]*\/\/ scraper_data_end.*$/;
        let jsonTxt = txt.replace(p1, '$1');
        if (!txt.match(p1)) {
            jsonTxt = txt.replace(p2, '$1');
        }
        const json = JSON.parse(jsonTxt);

Thanks, I've taken the initial idea here and made commit to fix issue. There is a PR #35 to merge into develop when I'm able to test later.

SoulHarsh007 commented 3 years ago

@HermanFassett you can have a look at: https://github.com/SoulHarsh007/youtube-scrape/blob/master/scraper.js, Hope it helps you 😄

MhmdSalah commented 3 years ago

@HermanFassett you can have a look at: https://github.com/SoulHarsh007/youtube-scrape/blob/master/scraper.js, Hope it helps you 😄

I have to say, i love how you optimized this code. Very elegant.

cosminadrianpopescu commented 3 years ago

I see that in the new implementation you only display the first page of the results? Because my problem right now is that the page parameter is not working anymore.

FashionCStar commented 3 years ago

https://github.com/FashionCStar/youtube-scrape I also customized your scraper and now it's working well on page 1 but it returns the same result as page 2

FashionCStar commented 3 years ago

@HermanFassett When I run your project on my local, I am getting blank results. {"results":[],"version":"0.1.1","parser":"json_format"}

so I changed the version to 0.1.2 but after a few times of request, it returns a blank results again. {"results":[],"version":"0.1.2","parser":"json_format"}

HermanFassett commented 3 years ago

@HermanFassett When I run your project on my local, I am getting blank results. {"results":[],"version":"0.1.1","parser":"json_format"}

so I changed the version to 0.1.2 but after a few times of request, it returns a blank results again. {"results":[],"version":"0.1.2","parser":"json_format"}

Is there a specific example that causes it to run blank? Or just continued results? Any exception message you could see? I couldn't get it to return blank after a few minutes of testing.

HermanFassett commented 3 years ago

@cosminadrianpopescu @FashionCStar okay, will check out pages in issue #36

FashionCStar commented 3 years ago

@HermanFassett http://youtube-scrape.herokuapp.com/api/search?q=angular&page=1 I got a blank result after this request 2 times

and when I run your project on my local, I am getting a blank result too localhost:3000/api/search?q=angular&page=1

HermanFassett commented 3 years ago

@HermanFassett http://youtube-scrape.herokuapp.com/api/search?q=angular&page=1 I got a blank result after this request 2 times

and when I run your project on my local, I am getting a blank result too localhost:3000/api/search?q=angular&page=1

Okay, yeah my heroku deploy was old version 0.1.1 since I haven't merged to master yet, so I would expect that. I pushed changes and now I see consistent results. You're running locally on the latest develop branch change 0.1.2?

FashionCStar commented 3 years ago

@HermanFassett of course, I am running version 0.1.2 on my local

image

FashionCStar commented 3 years ago

but after 2 or 3 times of api call request, it returns blank result

HermanFassett commented 3 years ago

@HermanFassett of course, I am running version 0.1.2 on my local

image

Okay, I asked because I did not expect you to get "parser":"json_format" in the results you posted previously. I expected either "parser": "json_format.scraper_data" or "parser": "json_format.original".

FashionCStar commented 3 years ago

so should I change the parser to "json_format.scraper_data" ?

FashionCStar commented 3 years ago

json["parser"] = "json_format.scraper_data"; like this @HermanFassett ?

HermanFassett commented 3 years ago

so should I change the parser to "json_format.scraper_data" ?

You shouldn't need to make the changes if you pull the development branch down to your local machine. That branch has all the changes. I'll be merging those changes into master soon. If you think you have all the changes and you've checked out development branch, run git log and verify the top commit says Fix youtube json parsing (#35) (30297509bd72979c596328303fb802ffea420115).

FashionCStar commented 3 years ago

Cool I will pull from development branch BTW, did you find a solution for page number?

HermanFassett commented 3 years ago

@cosminadrianpopescu @FashionCStar potential update coming to fix the page issue (#36) with initial work in eb6cc42c050bd1c326b18667c872dac96febe6a9 on secondary branch. Will need more work before merging to develop.

If this issue on intermittent failures appears to be fixed for you guys with the changes currently on develop, I can work on merging to master and closing this issue.

cosminadrianpopescu commented 3 years ago

For me the current issue is solved. No problem with the fix from develop.

FashionCStar commented 3 years ago

@cosminadrianpopescu Did you fix the page number issue?

cosminadrianpopescu commented 3 years ago

Did you fix the page number issue?

No, I was just saying that the current issue is solved. I will have a look at the PR that @HermanFassett was talking about and see if this fixes the page issue. But at the moment, the page issue is there.

FashionCStar commented 3 years ago

@HermanFassett How are you doing? Any good news for the page number issue? Regards

HermanFassett commented 3 years ago

@HermanFassett How are you doing? Any good news for the page number issue? Regards

I'm out of town atm so I can't really work on changes. You can try out the branch update-pagination for my change that should give you a pageToken and key to use for next page results, but I need to clean it up more before I merge into develop and master. I don't think it's possible to have the code work with the page=n query string it used to use.

FashionCStar commented 3 years ago

@HermanFassett I am running your update-pagination branch and it's working well I have one question How can I get the channel title and channel URL of each video in the parseVideoRenderer function?

HermanFassett commented 3 years ago

@HermanFassett I am running your update-pagination branch and it's working well I have one question How can I get the channel title and channel URL of each video in the parseVideoRenderer function?

The result of that method is an object with a video child object and a uploader child object. You can get data for the channel like uploader.username and uploader.url associated with that video result.

cosminadrianpopescu commented 3 years ago

I've tested the update-pagination branch and it works for me.

MhmdSalah commented 3 years ago

Hello @HermanFassett and thank you for your efforts. Just wanted to give you heads up that starting last Wednesday, 28 Oct 2020, youtube returns the new format only so all responses from the current repo return with parser = .scraper_data.

I suggest that you tweak the code to skip the check for the old html_format and check for .scraper_data first before the .original to optimize the code.

Best regards.