codemanki / cloudscraper

--DEPRECATED -- 🛑 🛑 Node.js library to bypass cloudflare's anti-ddos page
MIT License
602 stars 140 forks source link

403 Cookie-thingy error on nelly.com #264

Closed lillem4n closed 5 years ago

lillem4n commented 5 years ago

When trying to scrape a specific URL on nelly.com a few times, Cloudflares sends some kind of cookie-checker to see if this is a weird human or a robot. I'm testing with this URL: https://nelly.com/se/kläder-för-kvinnor/kläder/toppar/nly-trend-917/one-side-top-92337-0001/

The error response I get is:

403 - "<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>

<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]-->
<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]-->

</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
        <h1 data-translate="block_headline">Sorry, you have been blocked</h1>
        <h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> nelly.com</h2>
      </div><!-- /.header -->

      <div class="cf-section cf-highlight">
        <div class="cf-wrapper">
          <div class="cf-screenshot-container cf-screenshot-full">

              <span class="cf-no-screenshot error"></span>

          </div>
        </div>
      </div><!-- /.captcha-container -->

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="blocked_why_headline">Why have I been blocked?</h2>

            <p data-translate="blocked_why_detail">This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p>
          </div>

          <div class="cf-column">
            <h2 data-translate="blocked_resolve_headline">What can I do to resolve this?</h2>

            <p data-translate="blocked_resolve_detail">You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.</p>
          </div>
        </div>
      </div><!-- /.section -->

      <div class="cf-error-footer cf-wrapper">
  <p>
    <span class="cf-footer-item">Cloudflare Ray ID: <strong>51bb8b733830867d</strong></span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Your IP</span>: 212.37.30.210</span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Performance &amp; security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span>

    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item">
      <select id="lang-selector">
        <option value="">Select a Language</option>
        <option value="en">English</option>
        <option value="es">Español</option>
      </select>
    </span>

  </p>
</div><!-- /.error-footer -->

    </div><!-- /#cf-error-details -->
  </div><!-- /#cf-wrapper -->

  <script type="text/javascript">
  window._cf_translation = {};
  window._cf_translation.locale = 'en';
  window._cf_translation.blobs = {};
</script>

</body>
</html>"
lillem4n commented 5 years ago

I think it might have to do whit these headers:

'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
server: 'cloudflare',
'cf-ray': '51bbbb7d6fe5caf4-ARN'
lillem4n commented 5 years ago

We ended up actually using node-libcurl to go around this specific problem. Maybe cloudscraper should try a simple node-libcurl call first, and if that fails go on with the other methods?

codemanki commented 5 years ago

Hi @lillem4n . Thanks for reporting this issue. So the problem is in cloudflare returning the headers you mentioned ( and cloudscraper not passing them back maybe? )?

lillem4n commented 5 years ago

The problem seems to be that Cloudflare tells us to set a cookie and then redirect. Then after the redirect it checks if that cookie is set. If not, it gets mad and barfs 403 at us. This is a theory after reading curl debug output.

ghost commented 5 years ago

@lillem4n,

I ran this test:

const cloudscraper = require('cloudscraper');

cloudscraper.debug = true;

const uri = 'https://nelly.com/se/kl%C3%A4der-f%C3%B6r-kvinnor/kl%C3%A4der/toppar/nly-trend-917/one-side-top-92337-0001/';
const har = require('./nelly.com.har');

const expectCookies = har.log.entries[0].response.cookies.map(c => c.name);

(async () => {
  try {
    await cloudscraper.get(uri);
  } catch (error) {
    console.error(error);
  }

  const actualCookies = cloudscraper.defaultParams.jar.getCookies(uri).map(c => c.key)
  const cookieString = cloudscraper.defaultParams.jar.getCookieString(uri);

  console.log({
    jar: {
      matching: expectCookies.filter(name => -1 !== actualCookies.indexOf(name)),
      missing: expectCookies.filter(name => -1 === actualCookies.indexOf(name)),
      extraneous: actualCookies.filter(name => -1 === expectCookies.indexOf(name))
    },
    header: {
      matching: expectCookies.filter(name => -1 !== cookieString.indexOf(name)),
      missing: expectCookies.filter(name => -1 === cookieString.indexOf(name))
    }
  });

  console.log(`Cookie: ${cookieString}`);
})();

Everything checks out. Cloudscraper uses request which in turn uses (RFC6265-compliant) tough-cookie.

nelly.com.har.json ```json { "log": { "version": "1.2", "creator": { "name": "WebInspector", "version": "537.36" }, "pages": [ { "startedDateTime": "2019-09-25T11:44:21.849Z", "id": "page_1", "title": "https://nelly.com/se/kl%C3%A4der-f%C3%B6r-kvinnor/kl%C3%A4der/toppar/nly-trend-917/one-side-top-92337-0001/", "pageTimings": { "onContentLoad": null, "onLoad": null } } ], "entries": [ { "startedDateTime": "2019-09-25T11:44:21.835Z", "time": 664.5030000072438, "request": { "method": "GET", "url": "https://nelly.com/se/kl%C3%A4der-f%C3%B6r-kvinnor/kl%C3%A4der/toppar/nly-trend-917/one-side-top-92337-0001/", "httpVersion": "http/2.0", "headers": [ { "name": ":method", "value": "GET" }, { "name": ":authority", "value": "nelly.com" }, { "name": ":scheme", "value": "https" }, { "name": ":path", "value": "/se/kl%C3%A4der-f%C3%B6r-kvinnor/kl%C3%A4der/toppar/nly-trend-917/one-side-top-92337-0001/" }, { "name": "cache-control", "value": "max-age=0" }, { "name": "dnt", "value": "1" }, { "name": "upgrade-insecure-requests", "value": "1" }, { "name": "user-agent", "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36" }, { "name": "sec-fetch-mode", "value": "navigate" }, { "name": "sec-fetch-user", "value": "?1" }, { "name": "accept", "value": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" }, { "name": "sec-fetch-site", "value": "none" }, { "name": "referer", "value": "https://github.com/codemanki/cloudscraper/issues/264" }, { "name": "accept-encoding", "value": "gzip, deflate, br" }, { "name": "accept-language", "value": "en-US,en;q=0.9" } ], "queryString": [], "cookies": [], "headersSize": -1, "bodySize": 0 }, "response": { "status": 200, "statusText": "", "httpVersion": "http/2.0", "headers": [ { "name": "status", "value": "200" }, { "name": "date", "value": "Wed, 25 Sep 2019 11:44:22 GMT" }, { "name": "content-type", "value": "text/html; charset=utf-8" }, { "name": "set-cookie", "value": "__cfduid=d3094ac64216f48c2aaf228591f1028001569411861; expires=Thu, 24-Sep-20 11:44:21 GMT; path=/; domain=.nelly.com; HttpOnly; Secure" }, { "name": "set-cookie", "value": "channelId=1; expires=Fri, 25-Oct-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "languageId=1; expires=Fri, 25-Oct-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "ChoosenCountry1=se; expires=Fri, 25-Oct-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "ASP.NET_SessionId=wpmqw0v0lafnx3srxqij2qs5; path=/; HttpOnly" }, { "name": "set-cookie", "value": "channelId=1; expires=Fri, 25-Oct-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "languageId=1; expires=Fri, 25-Oct-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "ChoosenCountry1=se; expires=Fri, 25-Oct-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "ASP.NET_SessionId=wpmqw0v0lafnx3srxqij2qs5; path=/; HttpOnly" }, { "name": "set-cookie", "value": "CartSessionId=5cb06199-f045-4429-b1c5-f1b14f228762; expires=Mon, 30-Sep-2019 11:44:21 GMT; path=/" }, { "name": "set-cookie", "value": "__RequestVerificationToken=fPhOjuKB13RGlBbGSVrvCsvaAkfulleWN5_4Vwy5axQfrZCkXRpuodMBahLG-vEveD7xdRgN4aCayBbeMRAROH2QD2M1; path=/; HttpOnly" }, { "name": "set-cookie", "value": "NSC_OFMTDB-MC-WT-ofmmz.dpn-IUUQT=ffffffff092b009d45525d5f4f58455e445a4a423661;expires=Wed, 25-Sep-2019 11:46:22 GMT;path=/;secure;httponly" }, { "name": "cache-control", "value": "private" }, { "name": "vary", "value": "Accept-Encoding" }, { "name": "x-frame-options", "value": "SAMEORIGIN" }, { "name": "x-frame-options", "value": "SAMEORIGIN" }, { "name": "x-server", "value": "07" }, { "name": "expect-ct", "value": "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"" }, { "name": "server", "value": "cloudflare" }, { "name": "cf-ray", "value": "51bcbee8dda7e102-IAD" }, { "name": "content-encoding", "value": "br" } ], "cookies": [ { "name": "__cfduid", "value": "d3094ac64216f48c2aaf228591f1028001569411861", "path": "/", "domain": ".nelly.com", "expires": "2020-09-24T11:44:21.000Z", "httpOnly": true, "secure": true }, { "name": "channelId", "value": "1", "path": "/", "expires": "2019-10-25T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "languageId", "value": "1", "path": "/", "expires": "2019-10-25T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "ChoosenCountry1", "value": "se", "path": "/", "expires": "2019-10-25T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "ASP.NET_SessionId", "value": "wpmqw0v0lafnx3srxqij2qs5", "path": "/", "expires": null, "httpOnly": true, "secure": false }, { "name": "channelId", "value": "1", "path": "/", "expires": "2019-10-25T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "languageId", "value": "1", "path": "/", "expires": "2019-10-25T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "ChoosenCountry1", "value": "se", "path": "/", "expires": "2019-10-25T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "ASP.NET_SessionId", "value": "wpmqw0v0lafnx3srxqij2qs5", "path": "/", "expires": null, "httpOnly": true, "secure": false }, { "name": "CartSessionId", "value": "5cb06199-f045-4429-b1c5-f1b14f228762", "path": "/", "expires": "2019-09-30T11:44:21.000Z", "httpOnly": false, "secure": false }, { "name": "__RequestVerificationToken", "value": "fPhOjuKB13RGlBbGSVrvCsvaAkfulleWN5_4Vwy5axQfrZCkXRpuodMBahLG-vEveD7xdRgN4aCayBbeMRAROH2QD2M1", "path": "/", "expires": null, "httpOnly": true, "secure": false }, { "name": "NSC_OFMTDB-MC-WT-ofmmz.dpn-IUUQT", "value": "ffffffff092b009d45525d5f4f58455e445a4a423661", "path": "/", "expires": "2019-09-25T11:46:22.000Z", "httpOnly": true, "secure": true } ], "content": { "size": 251363, "mimeType": "text/html", "text": "" }, "redirectURL": "", "headersSize": -1, "bodySize": -1, "_transferSize": 39144 }, "cache": {}, "timings": { "blocked": 15.132000006753021, "dns": -1, "ssl": -1, "connect": -1, "send": 4.158, "wait": 605.3489999947268, "receive": 39.864000005763955, "_blocked_queueing": 14.07600000675302 }, "serverIPAddress": "104.16.167.241", "_initiator": { "type": "other" }, "_priority": "VeryHigh", "_resourceType": "document", "connection": "10739", "pageref": "page_1" } ] } } ```
lillem4n commented 5 years ago

Yes, we scraped this site perfectly 2 days ago, and hit this issue yesterday, so it seems it is not consistent. :(

ghost commented 5 years ago

@lillem4n,

I think it might have to do whit these headers:

'expect-ct': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
server: 'cloudflare',
'cf-ray': '51bbbb7d6fe5caf4-ARN'

The expect-ct header only has to do with certificate validation and I'm not sure that clients would communicate anything extra in response to it.

If you're getting a 403, see https://github.com/codemanki/cloudscraper#recaptcha Additionally, you might try:

cloudscraper.defaultParams.agentOptions.ciphers += ':!SHA';
lillem4n commented 5 years ago

I'm trying your code now to see if our IP have anything to do with triggering the issue. But what does this line do: const har = require('./nelly.com.har'); ?

lillem4n commented 5 years ago

We are unblocked again, I can not trigger the error anymore. :( Maybe close this until it comes back so I can try debugging again?

ghost commented 5 years ago

I'm trying your code now to see if our IP have anything to do with triggering the issue. But what does this line do: const har = require('./nelly.com.har'); ?

Browsers such as Firefox and Chromium allow you to save the request logs as HAR. The file is hidden under a spoiler in my previous post. Also Cloudscraper supports the har option.

We are unblocked again, I can not trigger the error anymore.

If you get blocked again, here's another cipher string for you to try: ':!ECDHE+SHA:!AES128-SHA:!AESCCM:!DHE:!ARIA'

Note: If not appending, the leading colon should be removed.