Post request fails after initial cloudflare bypass.

luizkc commented 5 years ago

Hi!

I'm using Cloudscraper version 4.14 on Node version 12.10.0.

I'm attempting to access this website, which has a cloudflare protection page with a captcha.

I can bypass the cloudflare and access the site's homepage/any page, however, after bypassing, I am unable to successfully send a post request. The weirdest part about this is that my code works using the Python version of the module recreating the exact same requests.

When sending the post request, the console sometimes prints:

request received invalid json when debug is on.

In my second request I need a csrf token that gets returned with the first request's (the bypass request) response. Essentially I am trying to create an account on this website by first retrieving the csrf after the initial bypass (which I can do successfully) and then sending a post request with the account information.

Like I said, I can do this successfully in Python which leads me to believe the issue is related to the module's way of handling post requests, but of course I'm probably wrong. This is my code when sending both requests.

const captchaAPI = require("imagetyperz-api")
const cloudscraper = require("cloudscraper").defaults({ onCaptcha })

cloudscraper.debug = true

let captchaRes
// CALLED IF SCRAPER RUNS INTO CAPTCHA
async function onCaptcha(options, response, body) {
  const captchaData = response.captcha
  captchaRes = await solveCaptcha(
    response.request.uri.href,
    captchaData.siteKey
  )
  captchaData.form["g-recaptcha-response"] = captchaRes
  captchaData.submit()
}

// CALLED WHEN SOLVING A CAPTCHA IS NECESSARY
async function solveCaptcha(uri, sitekey) {
  captchaAPI.set_access_key("KEY")
  const params = {
    page_url: uri,
    sitekey: sitekey
  }
  console.log("Solving captcha...")
  const id = await captchaAPI.submit_recaptcha(params)
  const token = await captchaAPI.retrieve_recaptcha(id)
  return token
}

// GENERATE AN ACCOUNT
async function genAccount(req) {
  const Csrf = req.body.match(/setRequestHeader\('X-AntiCsrfToken', '(.+)'/)[1]
  console.log(`CSRF: ${Csrf}`)
  cloudscraper.defaultParams.headers = {}
  console.log(`CAPTCHA: ${captchaRes}`)
  const headers = {
    authority: "www.nakedcph.com",
    path: "/auth/submit",
    scheme: "https",
    accept: "application/json, text/javascript, */*; q=0.01",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9,pt;q=0.8",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    origin: "https://www.nakedcph.com",
    referer: "https://www.nakedcph.com/auth/view?op=register",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "user-agent":
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
    "X-AntiCsrfToken": Csrf,
    "X-Requested-With": "XMLHttpRequest"
  }
  const gen = await cloudscraper({
    url: "https://nakedcph.com/auth/submit",
    method: "POST",
    resolveWithFullResponse: true,
    // followOriginalHttpMethod: true,
    json: true,
    simple: false,
    headers: headers,
    formData: {
      _AntiCsrfToken: Csrf,
      firstName: "CLOUDSCRAPER TEST",
      email: "cloudscraper+123456@gmail.com",
      password: "MyPassword123",
      "g-recaptcha-response": String(captchaRes),
      action: "register"
    }
  })
  console.log("RESULT:")
  console.log(gen.statusCode)
  console.log(gen.body)
}

const headers = {
    "User-Agent":
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
    Accept: "application/json, text/javascript, */*; q=0.01",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "X-Requested-With": "XMLHttpRequest"
  }
// FIRST REQUEST GETS SENT HERE
  const req = await cloudscraper({
    url: "https://www.nakedcph.com",
    method: "GET",
    resolveWithFullResponse: true,
    // json: true,
    simple: false,
    headers: headers
  })
  if (req.statusCode === 200) {
// IF FIRST REQUEST IS GOOD AND CLOUDFLARE IS BYPASSED, WE TRY SENDING THE POST
    await genAccount(req)
  }
  console.log(req.statusCode)

I get a 200 on the first request, and a 403 on the 2nd. This is what the server returns on the 403:

{ Response: null, StatusCode: 500, Status: '' }

Hopefully I'm being really stupid and there is a super simple solution to this. And like I said, my Python version is 100% functional doing the exact same request with the same headers and everything.

Thanks and sorry for any confusions and the long code. I've been looking at this for way longer than I should have and haven't been able to find the solution. Any help is much appreciated.

ghost commented 5 years ago

Hi @luizkc,

I'm sure we can figure this out. Would you mind sharing the debug output?

Redirect the stdout and stderr to a file or xclip and pastebin it:

node index.js > out.txt 2>&1
node index.js |& xclip -i -sel clipboard

The weirdest part about this is that my code works using the Python version of the module recreating the exact same requests.

Which python module? The same name python module is of plagiarism, license issues, has known vulnerabilities, spits in the face of FOSS, and is not to be trusted. It's a rip-off of the original cfscrape. Use cfscrape instead, it's maintained.

The challenge solving code in all of these libraries was written by me (including the plagiarized one) and they all generally work the same way if you're using equivalent options. The only exception being the redirect behavior. The python modules handle redirects in a non-standard way by always reusing the original request method instead of switching over to the GET method.

A very similar and recently solved issue: https://github.com/codemanki/cloudscraper/issues/255

When sending the post request, the console sometimes prints:

request received invalid json when debug is on.

When the json option is used, the request library's onRequestResponse handler attempts to parse the response as JSON. If you get e.g. HTML instead, it will intentionally fail silently unless, as you mentioned, debugging is enabled. The user should validate the response.body anyway since valid JSON once parsed could be a number, string, empty string, boolean, null, object, or an array. If you're expecting an array e.g. const valid = Array.isArray(response.body) && response.body.length > 0;.

If you're trying to post JSON, try the json option instead of formData:

const gen = await cloudscraper.post({
    url: "https://nakedcph.com/auth/submit",
    resolveWithFullResponse: true,
    followOriginalHttpMethod: true,
    simple: false,
    headers: headers,
    json: {
      _AntiCsrfToken: Csrf,
      firstName: "CLOUDSCRAPER TEST",
      email: "cloudscraper+123456@gmail.com",
      password: "MyPassword123",
      "g-recaptcha-response": String(captchaRes),
      action: "register"
    }
  })

Cheers

luizkc commented 5 years ago

Hi @pro-src. Thanks for the quick reply.

I tried sending the request as you said and it still did not work.

I used this module in Python, is this the one that is dangerous to use?

I have read issue 255 and tried everything in there, but in this case, it still didn't resolve the issue for me. I believe issue 255 had a similar problem but not the same, although it does try sending a GET when we are specifying POST at some point during the redirects. In other words, the 2nd request, which is a POST, get's redirected and GET requests get sent to the same endpoint. Is this where the issue lies?

Here is my out.txt file as you requested! I hope I'm doing something really stupid and that the solution is simple. I do apologize in advance if that is the case.

Thanks again for the help!

Edit: it was sending a GET request actually. Just read the out.txt again.

ghost commented 5 years ago

Thanks again for the help!

Yw :smile:

I used this module in Python, is this the one that is dangerous to use?

Yes:exclamation: I used to own that pypi.org project. Unfortunately, it is owned by a very cunning individual now. You've been warned.

I've noticed that you're attempting to send pseudo HTTP/2 headers.

The underlying request library doesn't support HTTP/2 and adds the host header automatically. The same applies to python's requests library. The :authority and host headers are mutually exclusive. The :authority header should not be sent when using HTTP/1. The host header should not be sent when using HTTP/2.

HAR file excerpt

```json { "request": { "method": "GET", "url": "https://www.nakedcph.com/auth/view?op=register", "httpVersion": "http/2.0", "headers": [ { "name": ":method", "value": "GET" }, { "name": ":authority", "value": "www.nakedcph.com" }, { "name": ":scheme", "value": "https" }, { "name": ":path", "value": "/auth/view?op=register" } ] } ```

All of the above headers should be omitted when imitating a browser's HTTP/1 request.

I'm using https://httpbin.org which responds with the request info that you sent to demonstrate the difference between the json and formData options:

Request - formData option

```js require('cloudscraper').get({ uri: 'https://httpbin.org/anything', formData: { test: 'foobar' } }).then(console.log) ```

Request as reported by httpbin.org

```json { "args": {}, "data": "", "files": {}, "form": { "test": "foobar" }, "headers": { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-US,en;q=0.8", "Content-Length": "165", "Content-Type": "multipart/form-data; boundary=--------------------------864593522106200008829719", "Host": "httpbin.org", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36" }, "json": null, "method": "GET", "origin": "54.166.62.111, 54.166.62.111", "url": "https://httpbin.org/anything" } ```

Request - json option

```js require('cloudscraper').get({ uri: 'https://httpbin.org/anything', json: { test: 'foobar' } }).then(console.log) ```

Request as reported by httpbin.org

```js { args: {}, data: '{"test":"foobar"}', files: {}, form: {}, headers: { Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.8', 'Content-Length': '17', 'Content-Type': 'application/json', Host: 'httpbin.org', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36' }, json: { test: 'foobar' }, method: 'GET', origin: '54.166.62.111, 54.166.62.111', url: 'https://httpbin.org/anything' } ```

Perform the same tests with your python code to ensure that everything lines up. If I had the debug output from python maybe I could pinpoint the issue.

Prepend the following to your python code to generate similar debug output:

Python code snippet

```py import logging try: from http.client import HTTPConnection # py3 except ImportError: from httplib import HTTPConnection # py2 HTTPConnection.debuglevel = 1 logging.basicConfig() logging.getLogger().setLevel(logging.DEBUG) requests_log = logging.getLogger("requests.packages.urllib3") requests_log.setLevel(logging.DEBUG) requests_log.propagate = True # import cfscrape # scraper = cfscrape.create_scraper() # print(scraper.get('https://google.com')) ```

ghost commented 5 years ago

With fresh eyes, the server expects the body to be application/x-www-form-urlencoded: https://github.com/request/request#forms

const cloudscraper = require('cloudscraper')
const { headers: defaultHeaders } = cloudscraper.defaultParams

const uri = new URL('https://www.nakedcph.com/auth/view?op=register')

const response = await cloudscraper.post({
  uri: new URL('/auth/submit', uri.href),
  resolveWithFullResponse: true,
  followOriginalHttpMethod: true,
  json: true,
  simple: false,
  headers: {
    ...defaultHeaders,
    Origin: uri.origin,
    Referer: uri.href,
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'X-AntiCsrfToken': csrf,
    'X-Requested-With': 'XMLHttpRequest'
  },
  form: {
    _AntiCsrfToken: csrf,
    firstName: username,
    email: username + '@gmail.com',
    password: password,
    'g-recaptcha-response': gRes,
    action: 'register'
  }
})

luizkc commented 5 years ago

@pro-src I love you. Thanks.

Following the above changes you made to the request, I was able to get a 500 response instead of a 403. The 500 response included a captchaError. So instead of letting Cloudscraper set the sitekey and URL to solve the captcha for, I simply hard-set them in my captcha solving function, sent the post request, and the 500 then became a 200, and now, everything works perfectly.

Takeaways:

Something with how you setup those headers and/or the URI, as in the example, above definitely did something to help.
Hard setting the captcha parameters can help if getting captcha errors. For the Captcha API I'm using, the URL can just be something like nakedcph.com as opposed to having a URL with a declared protocol and path.

Hope this helps anyone else having this issue and thank you so much @pro-src for all of the help.

My issue is resolved 😊

ghost commented 5 years ago

There hasn't been a whole lot of feedback concerning the reCaptcha related API.

Per your feedback, I've sent a PR(#260) to soft deprecate captcha.url in preference of captcha.uri which is an instance of the builtin URL class. This new property conveniently allows for captcha.uri.origin, captcha.uri.host, captcha.uri.hostname, etc. to be used in place of captcha.url. The old property would still be available with deprecation warnings for sometime and is equivalent to captcha.uri.href aka response.request.uri.href.

Secondly, I fixed a few bugs:

The fallback siteKey was being preferred over the primary siteKey that is taken from the data-sitekey attribute of the cf.challenge.js script tag.
The fallback siteKey regular expression was too greedy and could erroneously include URL query parameters, if ever present.
The siteKey related regular expressions could return an empty match, although unlikely.

Finally, the regular expressions have been greatly improved. The siteKey can be found 4 times within Cloudflare's reCaptcha(v2) response, e.g. https://captcha.website, and this update is aware of all them. Previously, this would only match 2, the data-sitekey attribute and the fallback.

Could I get you to test this and report back whether pinning the URL and/or siteKey is still necessary?

npm install "git://github.com/pro-src/cloudscraper.git#recaptcha"

OR

git clone --single-branch --branch recaptcha https://github.com/pro-src/cloudscraper
cd cloudscraper

# Feel free to replace npm with yarn in any of these commands
npm install # Optionally add --production, if skipping test
npm test # Optional but recommended

# If you're going to manually update your require calls, you're done
# Otherwise register cloudscraper with NPM globally
npm link

# Proceed to create a symlink to cloudscraper in your project's node_modules/
cd ../my-project
npm link cloudscraper
node index.js

luizkc commented 5 years ago

sorry for missing this! I was about to post another issue when I saw this. Will be performing these tests today and reporting back.

luizkc commented 5 years ago

@pro-src Did some more testing. I'm able to bypass the initial Cloudflare captcha page here. However, when sending the post request we revised above, I keep getting captchaError now or some unexpected responses from the server instead of a successful account creation, like in my Python script.

Any ideas on the fix? This is my code:

var cloudscraper = require("cloudscraper")
const captchaAPI = require("imagetyperz-api")
const { headers: defaultHeaders } = cloudscraper.defaultParams

//cloudscraper.debug = true

let captchaRes

async function onCaptcha(options, response, body) {
  const captcha = response.captcha
  // solveReCAPTCHA is a method that you should come up with and pass it href and sitekey, in return it will return you a reponse
  const token = await solveCaptcha(response.request.uri.href, captcha.siteKey)
  captcha.form["g-recaptcha-response"] = token
  captcha.submit()
}
// python sitekey = '6LeNqBUUAAAAAFbhC-CS22rwzkZjr_g4vMmqD_qo'
async function solveCaptcha(uri, sitekey) {
  captchaAPI.set_access_key("MY CAPTCHA SOLVING API KEY")
  const params = {
    page_url: uri,
    sitekey: sitekey
  }

  console.log("Solving captcha...")
  const id = await captchaAPI.submit_recaptcha(params)
  const token = await captchaAPI.retrieve_recaptcha(id)
  captchaRes = token
  return token
}

async function run() {
  const req = await cloudscraper.get({
    uri: "https://nakedcph.com",
    onCaptcha: onCaptcha,
    resolveWithFullResponse: true,
    simple: false,
    headers: {
      "user-agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
      Accept: "application/json, text/javascript, */*; q=0.01",
      "Accept-Language": "en-US,en;q=0.5",
      "Accept-Encoding": "gzip, deflate, br"
    }
  })
  if (req.statusCode === 200) {
    const AntiCsrfToken = req.body.match(
      /setRequestHeader\('X-AntiCsrfToken', '(.+)'/
    )[1]
    const post = await cloudscraper.post({
      uri: "https://nakedcph.com/auth/submit",
      onCaptcha: onCaptcha,
      resolveWithFullResponse: true,
      followOriginalHttpMethod: true,
      json: true,
      simple: false,
      headers: {
        ...defaultHeaders,
        "user-agent":
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
        Accept: "application/json, text/javascript, */*; q=0.01",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "x-AntiCsrfToken": AntiCsrfToken,
        "x-Requested-With": "XMLHttpRequest",
        origin: "https://www.nakedcph.com",
        referer: "https://www.nakedcph.com/auth/view?op=register"
      },
      form: {
        _AntiCsrfToken: AntiCsrfToken,
        firstName: "My Name",
        email: "myemail@gmail.com",
        password: "MyPassword",
        "g-recaptcha-response": captchaRes,
        action: "register"
      }
    })
    console.log("ACC RESULT:")
    console.log(post.statusCode)
    console.log(post.body)
    return
  }
}
run()

Running the following code with debug on generates the attached logs.

Something to note: The sitekey I use to solve captchas in Python is different from the one scraped by Cloudscraper in the example above. Very weird that the sitekey commented out works in the Python version but yields captchaErrors in the node js version.

Thanks for helping out, hopefully we can get this working and the module bugs sorted out ASAP!

out.txt

luizkc commented 5 years ago

@pro-src any updates?

ghost commented 5 years ago

Not as of yet.

luizkc commented 5 years ago

was the issue resolved? Just wondering why this was closed with no reply.

I'm still having issues with the example I sent on the updated version! Any help is appreciated. Thanks.

ghost commented 5 years ago

@luizkc

was the issue resolved? Just wondering why this was closed with no reply.

You said:

My issue is resolved :blush:

The issue was reopened to address the bugs that I discovered and subsequently closed automatically by Github once the (fix)PR was merged.

I understand there's a new issue that's related to the old one but would you mind opening a new issue for that and just referencing this one. I have meant to attend your issue.

Very weird that the sitekey commented out works in the Python version but yields captchaErrors in the node js version.

That's very weird indeed considering how the python regex is merely:

'data-sitekey="(.+?)"'

Where as Cloudscraper's primary siteKey regex is robust (not mentioning the fallbacks):

/\sdata-sitekey=["']?([^\s"'<>&]+)/

So if anything, the python code would be failing you. I would just create a simple (working) snippet to show the difference if there was one but I don't see it. Feel free to prove me wrong.

Somebody will eventually get around to your issue. If you would like me personally to expedite your issue, consider becoming a patron.

Thanks for your understanding.

luizkc commented 5 years ago

@pro-src i do not mind becoming a patron to get assisted ASAP!

Just tell me how to go about doing that and how much I should pledge to be worth your time :)

Would love to work on my issue specifically with you if possible. If becoming a patron is what it takes to get some 1-on-1 assistance from you I will definitely do it.

ghost commented 5 years ago

@luizkc There's a couple tiers at this link and there's an option to make a custom pledge. :)

luizkc commented 5 years ago

@pro-src awesome! Just pledged. See you in Discord! Can't wait :)

JinSBU commented 5 years ago

Sorry to comment on a closed issue, but was there a resolution to this? I am also experiencing the 500 issue @pro-src

codemanki / cloudscraper

Post request fails after initial cloudflare bypass. #259