Closed MhmdSalah closed 3 years ago
I had some time on my hands and did some tests.
It seems like the html returned now comes in two (or more) types. i made two files one. One contains html that is parsed successfully using your current method, called Goodhtml, and one that fails, called Badhtml.
here : https://www.dropbox.com/s/mwv8fbe6ctx9y4x/ytseachhtmlexamples.zip?dl=0
Funny thing is, in the Badhtml (the one that fails) it has comments about scrappers. Take a look at the file and look for
Alright, yeah it looks like a different format of html.
window["ytInitialData"] = {
...
};
window["ytInitialPlayerResponse"] = null;
vs
var ytInitialData = {
...
};
So looks like there need to be two methods or a more generic way of grabbing data from the html file to parse
And it also seems that the page parameter is not working anymore. It works only now and then, but more often not...
And it also seems that the page parameter is not working anymore. It works only now and then, but more often not...
That's interesting, from what I see, YouTube search results have infinite loading, but the page query still works on their website.
Yes, but if you try several times, finally will fail. It also works for me from time to time, but eventually it fails.
There is a temporary fix available.
html
.split("ytInitialData = ")[1]
.split("</script>")[0]
.replace("// scraper_data_end", "")
.replace(";", "");
this will be returning ytInitialData as string it can be parsed with JSON.parse()
.
This only works for response sent by reworked html^
Yes, but keep in mind that at the moment youtube pages also return the old version. So basically, you need to test before which version you have and then apply one pattern or the other.
const txt = html.replace(/[\n\r]/g, '');
const p1 = /^.*ytInitialData"[ ]*\][ ]*=[ ]*(.*);[ ]*window[ ]*\["ytInitialPlayerResponse"\].*$/;
const p2 = /^.*ytInitialData[ ]*=[ ]*(.*);[ ]*\/\/ scraper_data_end.*$/;
let jsonTxt = txt.replace(p1, '$1');
if (!txt.match(p1)) {
jsonTxt = txt.replace(p2, '$1');
}
const json = JSON.parse(jsonTxt);
Yes, but keep in mind that at the moment youtube pages also return the old version. So basically, you need to test before which version you have and then apply one pattern or the other.
const txt = html.replace(/[\n\r]/g, ''); const p1 = /^.*ytInitialData"[ ]*\][ ]*=[ ]*(.*);[ ]*window[ ]*\["ytInitialPlayerResponse"\].*$/; const p2 = /^.*ytInitialData[ ]*=[ ]*(.*);[ ]*\/\/ scraper_data_end.*$/; let jsonTxt = txt.replace(p1, '$1'); if (!txt.match(p1)) { jsonTxt = txt.replace(p2, '$1'); } const json = JSON.parse(jsonTxt);
Thanks, I've taken the initial idea here and made commit to fix issue. There is a PR #35 to merge into develop when I'm able to test later.
@HermanFassett you can have a look at: https://github.com/SoulHarsh007/youtube-scrape/blob/master/scraper.js, Hope it helps you 😄
@HermanFassett you can have a look at: https://github.com/SoulHarsh007/youtube-scrape/blob/master/scraper.js, Hope it helps you 😄
I have to say, i love how you optimized this code. Very elegant.
I see that in the new implementation you only display the first page of the results? Because my problem right now is that the page parameter is not working anymore.
https://github.com/FashionCStar/youtube-scrape I also customized your scraper and now it's working well on page 1 but it returns the same result as page 2
@HermanFassett When I run your project on my local, I am getting blank results. {"results":[],"version":"0.1.1","parser":"json_format"}
so I changed the version to 0.1.2 but after a few times of request, it returns a blank results again. {"results":[],"version":"0.1.2","parser":"json_format"}
@HermanFassett When I run your project on my local, I am getting blank results. {"results":[],"version":"0.1.1","parser":"json_format"}
so I changed the version to 0.1.2 but after a few times of request, it returns a blank results again. {"results":[],"version":"0.1.2","parser":"json_format"}
Is there a specific example that causes it to run blank? Or just continued results? Any exception message you could see? I couldn't get it to return blank after a few minutes of testing.
@cosminadrianpopescu @FashionCStar okay, will check out pages in issue #36
@HermanFassett http://youtube-scrape.herokuapp.com/api/search?q=angular&page=1 I got a blank result after this request 2 times
and when I run your project on my local, I am getting a blank result too localhost:3000/api/search?q=angular&page=1
@HermanFassett http://youtube-scrape.herokuapp.com/api/search?q=angular&page=1 I got a blank result after this request 2 times
and when I run your project on my local, I am getting a blank result too localhost:3000/api/search?q=angular&page=1
Okay, yeah my heroku deploy was old version 0.1.1 since I haven't merged to master yet, so I would expect that. I pushed changes and now I see consistent results. You're running locally on the latest develop branch change 0.1.2?
@HermanFassett of course, I am running version 0.1.2 on my local
but after 2 or 3 times of api call request, it returns blank result
@HermanFassett of course, I am running version 0.1.2 on my local
Okay, I asked because I did not expect you to get "parser":"json_format"
in the results you posted previously. I expected either "parser": "json_format.scraper_data"
or "parser": "json_format.original"
.
so should I change the parser to "json_format.scraper_data"
?
json["parser"] = "json_format.scraper_data";
like this @HermanFassett ?
so should I change the parser to
"json_format.scraper_data"
?
You shouldn't need to make the changes if you pull the development
branch down to your local machine. That branch has all the changes. I'll be merging those changes into master
soon.
If you think you have all the changes and you've checked out development branch, run git log
and verify the top commit says Fix youtube json parsing (#35)
(30297509bd72979c596328303fb802ffea420115).
Cool
I will pull from development
branch
BTW, did you find a solution for page number?
@cosminadrianpopescu @FashionCStar potential update coming to fix the page issue (#36) with initial work in eb6cc42c050bd1c326b18667c872dac96febe6a9 on secondary branch. Will need more work before merging to develop.
If this issue on intermittent failures appears to be fixed for you guys with the changes currently on develop, I can work on merging to master and closing this issue.
For me the current issue is solved. No problem with the fix from develop.
@cosminadrianpopescu Did you fix the page number issue?
Did you fix the page number issue?
No, I was just saying that the current issue is solved. I will have a look at the PR that @HermanFassett was talking about and see if this fixes the page issue. But at the moment, the page issue is there.
@HermanFassett How are you doing? Any good news for the page number issue? Regards
@HermanFassett How are you doing? Any good news for the page number issue? Regards
I'm out of town atm so I can't really work on changes. You can try out the branch update-pagination for my change that should give you a pageToken and key to use for next page results, but I need to clean it up more before I merge into develop and master. I don't think it's possible to have the code work with the page=n
query string it used to use.
@HermanFassett
I am running your update-pagination
branch and it's working well
I have one question
How can I get the channel title and channel URL of each video in the parseVideoRenderer
function?
@HermanFassett I am running your
update-pagination
branch and it's working well I have one question How can I get the channel title and channel URL of each video in theparseVideoRenderer
function?
The result of that method is an object with a video
child object and a uploader
child object. You can get data for the channel like uploader.username
and uploader.url
associated with that video result.
I've tested the update-pagination
branch and it works for me.
Hello @HermanFassett and thank you for your efforts. Just wanted to give you heads up that starting last Wednesday, 28 Oct 2020, youtube returns the new format only so all responses from the current repo return with parser = .scraper_data.
I suggest that you tweak the code to skip the check for the old html_format and check for .scraper_data first before the .original to optimize the code.
Best regards.
Hello! recently, i noticed that the scrapper fails more often. Here is a video showing what i mean: https://www.dropbox.com/s/bfn858cbaqo7b6m/video%20for%20youtube%20scrapper.mov?dl=0 When i have time, i will try to investigate this issue.