Receiving blocked html response

lexlpz commented 4 years ago

Hi @codemanki and team, First of all I want to say that I am really enjoying playing with cloudscraper, you guys did amazing work! I just wanted to report a problem that I am having specifically with one site or URLs. I don't know if this is the correct place to write to you but reading the description brought me here. So, my apologies if I should write this somewhere else or if this can be solved using some feature that is unkown for me. The issue I am experiencing is that the crawled site blocks me after 8-12 requests, sometimes sooner. The requests are not even inmediately after each other, I am testing while I code and sometimes between 2 requests can pass 2 minutes or more. Attached you will find the body of the response. It looks like a generic block page (Pardon Our Interruption). It is happening when I crawl the home advisor site but only the specific profile pages, it doesn't seem to happen with the home page. This is my code:

var URL='https://www.homeadvisor.com/rated.ADorazioDesign.78724440.html';
cloudscraper.get(URL, function (error, response, body) {
       if (error) {
            console.log(error);
       }
       fs.writeFile('./log/haBody.html',body, function(err) {
           if(err) {
               return console.log(err);
           }
           console.log("The response body was saved!");
       });
       let $ = cheerio.load(body);
       var title = $('head > title').text().trim();
       return title;
});

Any help would be greatly appreciated. Alex. haBody.zip

lexlpz commented 4 years ago

I tried to upload the code to my VPS and from there and using the above code, home advisor always responds with a 403 code.

codemanki commented 4 years ago

@lexlpz thanks for reporting this, happy to hear that you like cloudscraper :) I had a quick look into that website, and I think I managed to find a solution:

cloudscraper('https://www.homeadvisor.com/rated.ADorazioDesign.78724440.html', {headers: {
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
  cookie: 'cf_clearance=e99bd8abb53fdcc10389589f0f1cfc4e8727f9d8-1575710033-0-250; __cfduid=d0a1fc01ddeee8830a19d6e57e815181a1575710033; JSESSIONID=FD1800857397AAD77700C619F6CB1740.pwspr014-1; ..........'
}}).then((htmlStr) => {
  console.log(htmlStr)
}).catch((err) => {
  console.log(err)
})

You may notice that I have specified custom user agent and also copy pasted the cookies that homeadvisor sets you after the first successful request. I guess that might not help to mass scraping, but most likely they are putting something in your cookies that you need to preserve.

lexlpz commented 4 years ago

@codemanki thank you so much for your response and help! I tried it and it worked but I had to manually inspect my request headers and paste the new cookie in my code. How would I do it to automatically get the cookie value and then put that value in the headers of cloudscraper? Maybe with a second child cloudscraper request like so?

cloudscraper(URL)
     .then((htmlStr) => {
            var cookie = htmlStr.request.headers.cookie;
            cloudscraper(URL, {
                headers: {
                    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
                    cookie: cookie
                }
            }).then((htmlStr) => {
                var body = htmlStr.body;
                console.log(body);
            })                
        }).catch((err) => {
            console.log(err)
       })

I tried it and it worked but I am not sure if I am getting the correct cookies and passing them to cloudscraper. Is this the way you would do it? Thanks again, Alex.

codemanki commented 4 years ago

You can fetch cookies using custom cookie jar - https://github.com/codemanki/cloudscraper/issues/299#issuecomment-562938486

codemanki / cloudscraper

Receiving blocked html response #296