Closed lillem4n closed 5 years ago
I think it might have to do whit these headers:
'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
server: 'cloudflare',
'cf-ray': '51bbbb7d6fe5caf4-ARN'
We ended up actually using node-libcurl to go around this specific problem. Maybe cloudscraper should try a simple node-libcurl call first, and if that fails go on with the other methods?
Hi @lillem4n . Thanks for reporting this issue. So the problem is in cloudflare returning the headers you mentioned ( and cloudscraper not passing them back maybe? )?
The problem seems to be that Cloudflare tells us to set a cookie and then redirect. Then after the redirect it checks if that cookie is set. If not, it gets mad and barfs 403 at us. This is a theory after reading curl debug output.
@lillem4n,
I ran this test:
const cloudscraper = require('cloudscraper');
cloudscraper.debug = true;
const uri = 'https://nelly.com/se/kl%C3%A4der-f%C3%B6r-kvinnor/kl%C3%A4der/toppar/nly-trend-917/one-side-top-92337-0001/';
const har = require('./nelly.com.har');
const expectCookies = har.log.entries[0].response.cookies.map(c => c.name);
(async () => {
try {
await cloudscraper.get(uri);
} catch (error) {
console.error(error);
}
const actualCookies = cloudscraper.defaultParams.jar.getCookies(uri).map(c => c.key)
const cookieString = cloudscraper.defaultParams.jar.getCookieString(uri);
console.log({
jar: {
matching: expectCookies.filter(name => -1 !== actualCookies.indexOf(name)),
missing: expectCookies.filter(name => -1 === actualCookies.indexOf(name)),
extraneous: actualCookies.filter(name => -1 === expectCookies.indexOf(name))
},
header: {
matching: expectCookies.filter(name => -1 !== cookieString.indexOf(name)),
missing: expectCookies.filter(name => -1 === cookieString.indexOf(name))
}
});
console.log(`Cookie: ${cookieString}`);
})();
Everything checks out. Cloudscraper uses request which in turn uses (RFC6265-compliant) tough-cookie.
Yes, we scraped this site perfectly 2 days ago, and hit this issue yesterday, so it seems it is not consistent. :(
@lillem4n,
I think it might have to do whit these headers:
'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
server: 'cloudflare',
'cf-ray': '51bbbb7d6fe5caf4-ARN'
The expect-ct header only has to do with certificate validation and I'm not sure that clients would communicate anything extra in response to it.
If you're getting a 403, see https://github.com/codemanki/cloudscraper#recaptcha Additionally, you might try:
cloudscraper.defaultParams.agentOptions.ciphers += ':!SHA';
I'm trying your code now to see if our IP have anything to do with triggering the issue. But what does this line do: const har = require('./nelly.com.har');
?
We are unblocked again, I can not trigger the error anymore. :( Maybe close this until it comes back so I can try debugging again?
I'm trying your code now to see if our IP have anything to do with triggering the issue. But what does this line do: const har = require('./nelly.com.har'); ?
Browsers such as Firefox and Chromium allow you to save the request logs as HAR. The file is hidden under a spoiler in my previous post. Also Cloudscraper supports the har option.
We are unblocked again, I can not trigger the error anymore.
If you get blocked again, here's another cipher string for you to try:
':!ECDHE+SHA:!AES128-SHA:!AESCCM:!DHE:!ARIA'
Note: If not appending, the leading colon should be removed.
When trying to scrape a specific URL on nelly.com a few times, Cloudflares sends some kind of cookie-checker to see if this is a weird human or a robot. I'm testing with this URL: https://nelly.com/se/kläder-för-kvinnor/kläder/toppar/nly-trend-917/one-side-top-92337-0001/
The error response I get is: