Closed lexlpz closed 4 years ago
I tried to upload the code to my VPS and from there and using the above code, home advisor always responds with a 403 code.
@lexlpz thanks for reporting this, happy to hear that you like cloudscraper :) I had a quick look into that website, and I think I managed to find a solution:
cloudscraper('https://www.homeadvisor.com/rated.ADorazioDesign.78724440.html', {headers: {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
cookie: 'cf_clearance=e99bd8abb53fdcc10389589f0f1cfc4e8727f9d8-1575710033-0-250; __cfduid=d0a1fc01ddeee8830a19d6e57e815181a1575710033; JSESSIONID=FD1800857397AAD77700C619F6CB1740.pwspr014-1; ..........'
}}).then((htmlStr) => {
console.log(htmlStr)
}).catch((err) => {
console.log(err)
})
You may notice that I have specified custom user agent and also copy pasted the cookies that homeadvisor sets you after the first successful request. I guess that might not help to mass scraping, but most likely they are putting something in your cookies that you need to preserve.
@codemanki thank you so much for your response and help! I tried it and it worked but I had to manually inspect my request headers and paste the new cookie in my code. How would I do it to automatically get the cookie value and then put that value in the headers of cloudscraper? Maybe with a second child cloudscraper request like so?
cloudscraper(URL)
.then((htmlStr) => {
var cookie = htmlStr.request.headers.cookie;
cloudscraper(URL, {
headers: {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
cookie: cookie
}
}).then((htmlStr) => {
var body = htmlStr.body;
console.log(body);
})
}).catch((err) => {
console.log(err)
})
I tried it and it worked but I am not sure if I am getting the correct cookies and passing them to cloudscraper. Is this the way you would do it? Thanks again, Alex.
You can fetch cookies using custom cookie jar - https://github.com/codemanki/cloudscraper/issues/299#issuecomment-562938486
Hi @codemanki and team, First of all I want to say that I am really enjoying playing with cloudscraper, you guys did amazing work! I just wanted to report a problem that I am having specifically with one site or URLs. I don't know if this is the correct place to write to you but reading the description brought me here. So, my apologies if I should write this somewhere else or if this can be solved using some feature that is unkown for me. The issue I am experiencing is that the crawled site blocks me after 8-12 requests, sometimes sooner. The requests are not even inmediately after each other, I am testing while I code and sometimes between 2 requests can pass 2 minutes or more. Attached you will find the body of the response. It looks like a generic block page (Pardon Our Interruption). It is happening when I crawl the home advisor site but only the specific profile pages, it doesn't seem to happen with the home page. This is my code:
Any help would be greatly appreciated. Alex. haBody.zip