Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.38k stars 459 forks source link

ReferenceError: atob is not defined #215

Closed Krylanc3lo closed 5 years ago

Krylanc3lo commented 5 years ago

Hello,

I got the below error since a couple of days: js2py.internals.simplex.JsException: ReferenceError: atob is not defined

File "/home/maxx/.local/lib/python3.6/site-packages/js2py/base.py", line 1074, in get return self.prototype.get(prop, throw) File "/home/maxx/.local/lib/python3.6/site-packages/js2py/base.py", line 1079, in get raise MakeError('ReferenceError', '%s is not defined' % prop) js2py.internals.simplex.JsException: ReferenceError: atob is not defined

Anyone is experiencing the same ?

Thank you!

pawliczka commented 5 years ago

I think that it is releated to #212

pawliczka commented 5 years ago

i think that we have to add at the beginning of the node -e command

if (typeof atob === 'undefined') {
  global.atob = function (b64Encoded) {
    return new Buffer(b64Encoded, 'base64').toString('binary');
  };
}

`$ node -e "global.Buffer = global.Buffer || require('buffer').Buffer;if (typeof atob === 'undefined') {global.atob = function (str) {return new Buffer(str, 'base64').toString('binary');};}console.log(atob('Hello'));"

ée `

Krylanc3lo commented 5 years ago

Thank you Pawel, I will try it

You mean adding this code in base.py script ?

pawliczka commented 5 years ago

We have to edit this line js = "console.log(require('vm').runInNewContext('%s', Object.create(null), {timeout: 5000}));" % js I will create pull request when I back home

lukastribus commented 5 years ago

js2py has not been used for a long time. First of all, update your code please (and install node).

Krylanc3lo commented 5 years ago

Thanks Lukas for the suggestion, using node & updating the code led to the same error but at least I am using the latest version:

ReferenceError: atob is not defined at evalmachine.:1:609 at evalmachine.:1:908 at ContextifyScript.Script.runInContext (vm.js:59:29) at ContextifyScript.Script.runInNewContext (vm.js:65:15) at Object.runInNewContext (vm.js:135:38) at [eval]:1:27 at ContextifyScript.Script.runInThisContext (vm.js:50:33) at Object.runInThisContext (vm.js:139:38) at Object. ([eval]-wrapper:6:22) at Module._compile (module.js:652:30) ERROR:root:Error executing Cloudflare IUAM Javascript. Cloudflare may have changed their technique, or there may be a bug in the script.

I will try to implement Pawel's suggestion

pawliczka commented 5 years ago

It is more complicated than i thought before. But we can replace content between atob("ZG9jdW1l") and atob("aW5uZXJIVE1M") ('document.getElementById(k).innerHTML') with the text defined under html element with id defined by k variable

pawliczka commented 5 years ago

212 is coused by this function which returns ASCI code of letter at t[p]

`(function(p){return eval((true+"")[0]+"."+([]["fill"]+"")[3]+(+(101))["to"+String["name"]](21)[1]+(false+"")[1]+(true+"")[1]+Function("return escape")()(("")["italics"]())[2]+(true+[]["fill"])[10]+(undefined+"")[2]+(true+"")[3]+(+[]+Array)[10]+(true+"")[0]+"("+p+")")}(+((!+[]+!![]+!![]+[])))) `

And we got 'Cannot read property 'charCodeAt' of undefined' because we are not passing t variable to nodejs call

pawliczka commented 5 years ago

Ok. I finally got it. I will provide code tomorrow.

Krylanc3lo commented 5 years ago

Great! looking forward to it.

Thanks again Pawel!

VeNoMouS commented 5 years ago

@pawliczka have you got any pseudo code that we can implement in the mean time for those with projects that relay on bypassing cf?

pawliczka commented 5 years ago

To solve the problem with undefined atob you have to replace atob("ZG9jdW1l")....atob("aW5uZXJIVE1M") for me: atob("ZG9jdW1l")+(undefined+"")[1]+(true+"")[0]+(+(+!+[]+[+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+[!+[]+!+[]]+[+[]])+[])[+!+[]]+(false+[0]+String)[20]+(true+"")[3]+(true+"")[0]+"Element"+(+[]+Boolean)[10]+(NaN+[Infinity])[10]+"Id("+(+(20))["to"+String["name"]](21)+")."+atob("aW5uZXJIVE1M") with data under element defined by variable k (for me k = 'cf-dn-lZTYtMjTTnWU';) and <div style="display:none;visibility:hidden;" id="cf-dn-lZTYtMjTTnWU">+((!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+[])+(+!![])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![])+(!+[]+!![]+!![]+!![]+!![])+(+[])+(!+[]+!![]+!![]+!![])+(!+[]+!![])+(!+[]+!![]+!![]+!![])+(!+[]+!![]+!![]))/+((+!![]+[])+(+!![])+(+[])+(+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![])+(!+[]+!![]+!![]+!![])+(!+[]+!![]+!![]+!![]+!![])+(!+[]+!![]+!![]+!![]+!![]+!![])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]))</div> Also you have to solve the problem with TypeError: Cannot read property 'charCodeAt' of undefined and a.value you can do it this way: js = js.replace('a.value','a') js = js.replace("; 121",'') js = "console.log(require('vm').runInNewContext('var a; var t = \"%s\";%s', Object.create(null), {timeout: 5000}));" % (domain, js) Now it should work fine. I think that cf provide new challenge algorithm only for some part of users. Some domains still using old challenge algorithm as in #212

Krylanc3lo commented 5 years ago

In which file do you find atob function ?

VeNoMouS commented 5 years ago

atob is part of node https://www.npmjs.com/package/atob, @Krylanc3lo personally i back port most of the changes and still use js2py and avoid node at all costs..

Krylanc3lo commented 5 years ago

Thanks @VeNoMouS. Do you know what I have to update on js2py side ?

pawliczka commented 5 years ago

atob("ZG9jdW1l")+(undefined+"")[1]+(true+"")[0] = document +(+(+!+[]+[+!+[]]+(!![]+[])[!+[]+!+[]+!+[]]+[!+[]+!+[]]+[+[]])+[])[+!+[]]+(false+[0]+String)[20]+(true+"")[3]+(true+"")[0]+"Element"+(+[]+Boolean)[10]+(NaN+[Infinity])[10]+"Id("+(+(20))["to"+String["name"]](21)+")." = .getElementById(k). +atob("aW5uZXJIVE1M") = innerHTML document.getElementById(k).innerHTML

VeNoMouS commented 5 years ago

@Krylanc3lo looking into it myself

pawliczka commented 5 years ago

@VeNoMouS Could you please send me a diff or link for your fork when you are done? I'm going sleep now.

VeNoMouS commented 5 years ago

@pawliczka sweet as mate :)

VeNoMouS commented 5 years ago

lol this jsfuck is really annoying when trying to work out what its attempting to do...

ghost commented 5 years ago

@VeNoMouS I just wrote some code to take the pain out of it. https://github.com/codemanki/cloudscraper/issues/170#issuecomment-478203909

VeNoMouS commented 5 years ago

@pro-src im just doing the same in python ;P nice job :)

ghost commented 5 years ago

@VeNoMouS Also here is a node based definition for atob.

function atob(str) {
  return Buffer.from(str, 'base64').toString('binary');
}
VeNoMouS commented 5 years ago

ah thanks, i ended up just replacing it with regex base64'd till i got it all working

pawliczka commented 5 years ago

206 ^^ @VeNoMouS could you please share your solution?

ghost commented 5 years ago

https://www.npmjs.com/package/cf-debug

VeNoMouS commented 5 years ago

so.... my rewrite produced this... its a bit of a hack atm...

image

how ever it breaks under js2py .. , im trying to work that out.

  File "/usr/local/lib/python2.7/dist-packages/js2py/base.py", line 1001, in callprop
    '%s is not a function' % cand.typeof())
js2py.internals.simplex.JsException: TypeError: 'undefined' is not a function
VeNoMouS commented 5 years ago

Ok ... i found one of the root causes of the "undefined" but still got another issue i think...

("")["italics"]() is str.italics()... js2py doesn't know how to handle it..

VeNoMouS commented 5 years ago
import logging
import random
import re
from pprint import pprint
from base64 import b64decode

from copy import deepcopy
from time import sleep

#from lib import js2py
import js2py
from lib.requests.sessions import Session

try:
    from urlparse import urlparse
except ImportError:
    from urllib.parse import urlparse

__version__ = "1.9.5"

# Orignally written by https://github.com/Anorov/cloudflare-scrape
# Rewritten by VeNoMouS - <venom@gen-x.co.nz> for https://github.com/VeNoMouS/Sick-Beard - 24/3/2018 NZDT

DEFAULT_USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36",
    "Mozilla/5.0 (Linux; Android 7.0; Moto G (5) Build/NPPS25.137-93-8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_4 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11B554a Safari/9537.53",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0",
    "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
]

DEFAULT_USER_AGENT = random.choice(DEFAULT_USER_AGENTS)

BUG_REPORT = """\
Cloudflare may have changed their technique, or there may be a bug in the script.
"""

ANSWER_ACCEPT_ERROR = """\
The challenge answer was not properly accepted by Cloudflare. This can occur if \
the target website is under heavy load, or if Cloudflare is experiencing issues. You can
potentially resolve this by increasing the challenge answer delay (default: 8 seconds). \
For example: cfscrape.create_scraper(delay=15)
"""

class CloudflareScraper(Session):
    def __init__(self, *args, **kwargs):
        self.delay = kwargs.pop("delay", 8)
        super(CloudflareScraper, self).__init__(*args, **kwargs)

        if "requests" in self.headers["User-Agent"]:
            # Set a random User-Agent if no custom User-Agent has been set
            self.headers["User-Agent"] = DEFAULT_USER_AGENT

    def is_cloudflare_challenge(self, resp):
        return (
            resp.status_code == 503
            and resp.headers.get("Server", "").startswith("cloudflare")
            and b"jschl_vc" in resp.content
            and b"jschl_answer" in resp.content
        )

    def request(self, method, url, *args, **kwargs):
        resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)

        # Check if Cloudflare anti-bot is on
        if self.is_cloudflare_challenge(resp):
            resp = self.solve_cf_challenge(resp, **kwargs)

        return resp

    def solve_cf_challenge(self, resp, **original_kwargs):
        sleep(self.delay)  # Cloudflare requires a delay before solving the challenge

        body = resp.text

        rq = re.search('<div style="display:none;visibility:hidden;" id="(.*?)">(.*?)<\/div>', body,re.MULTILINE | re.DOTALL)

        body = re.sub(
            r'function\(p\){var p = eval\(eval\(atob\(".*?"\)\+\(undefined\+""\)\[1\]\+\(true\+""\)\[0\]\+\(\+\(\+!\+\[\]\+\[\+!\+\[\]\]\+\(!!\[\]\+\[\]\)\[!\+\[\]\+!\+\[\]\+!\+\[\]\]\+\[!\+\[\]\+!\+\[\]\]\+\[\+\[\]\]\)\+\[\]\)\[\+!\+\[\]\]\+\(false\+\[0\]\+String\)\[20\]\+\(true\+""\)\[3\]\+\(true\+""\)\[0\]\+"Element"\+\(\+\[\]\+Boolean\)\[10\]\+\(NaN\+\[Infinity\]\)\[10\]\+"Id\("\+\(\+\(20\)\)\["to"\+String\["name"\]\]\(21\)\+"\)."\+atob\(".*?"\)\)\); return \+\(p\)}\(\);',
            "{};".format(rq.group(2)),
            body
        )

        parsed_url = urlparse(resp.url)
        domain = parsed_url.netloc

        submit_url = "%s://%s/cdn-cgi/l/chk_jschl" % (parsed_url.scheme, domain)

        cloudflare_kwargs = deepcopy(original_kwargs)
        params = cloudflare_kwargs.setdefault("params", {})
        headers = cloudflare_kwargs.setdefault("headers", {})
        headers["Referer"] = resp.url

        try:
            params["jschl_vc"] = re.search(r'name="jschl_vc" value="(\w+)"', body).group(1)
            params["pass"] = re.search(r'name="pass" value="(.+?)"', body).group(1)
            params["s"] = re.search(r'name="s"\svalue="(?P<s_value>[^"]+)', body).group('s_value')

        except Exception as e:
            # Something is wrong with the page.
            # This may indicate Cloudflare has changed their anti-bot
            # technique. If you see this and are running the latest version,
            # please open a GitHub issue so I can update the code accordingly.
            raise ValueError("Unable to parse Cloudflare anti-bots page: %s %s" % (e.message, BUG_REPORT))

        # Solve the Javascript challenge
        params["jschl_answer"] = self.solve_challenge(body, domain)
        pprint(params)

        # Requests transforms any request into a GET after a redirect,
        # so the redirect has to be handled manually here to allow for
        # performing other types of requests even as the first request.
        method = resp.request.method
        cloudflare_kwargs["allow_redirects"] = False
        redirect = self.request(method, submit_url, **cloudflare_kwargs)
        pprint(redirect.content)
        #exit()

        redirect_location = urlparse(redirect.headers["Location"])
        if not redirect_location.netloc:
            redirect_url = "%s://%s%s" % (parsed_url.scheme, domain, redirect_location.path)
            return self.request(method, redirect_url, **original_kwargs)
        return self.request(method, redirect.headers["Location"], **original_kwargs)

    def solve_challenge(self, body, domain):
        try:
            js = re.search(r"setTimeout\(function\(\){\s+(var "
                        "s,t,o,p,b,r,e,a,k,i,n,g,f.+?\r?\n[\s\S]+?a\.value =.+?)\r?\n", body).group(1)

        except Exception:
            raise ValueError("Unable to identify Cloudflare IUAM Javascript on website. %s" % BUG_REPORT)

        js = re.sub(r"a\.value = ((.+).toFixed\(10\))?", r"\1", js)
        js = re.sub(r"\s{3,}[a-z](?: = |\.).+", "", js).replace("t.length", str(len(domain)))

        js = js.replace('; 121', '')

        js = js.replace('function(p){return eval((true+"")[0]+"."+([]["fill"]+"")[3]+(+(101))["to"+String["name"]](21)[1]+(false+"")[1]+(true+"")[1]+Function("return escape")()(("")["italics"]())[2]+(true+[]["fill"])[10]+(undefined+"")[2]+(true+"")[3]+(+[]+Array)[10]+(true+"")[0]+"("+p+")")}', 't.charCodeAt')

        # Strip characters that could be used to exit the string context
        # These characters are not currently used in Cloudflare's arithmetic snippet
        js = re.sub(r"[\n\\']", "", js)

        if "toFixed" not in js:
            raise ValueError("Error parsing Cloudflare IUAM Javascript challenge. %s" % BUG_REPORT)

        try:
            js = "a = {}; t = \"" + domain + "\";" + js
            result = js2py.eval_js(js)

        except Exception:
            logging.error("Error executing Cloudflare IUAM Javascript. %s" % BUG_REPORT)
            raise

        try:
            float(result)
        except Exception:
            raise ValueError("Cloudflare IUAM challenge returned unexpected answer. %s" % BUG_REPORT)

        return result

    @classmethod
    def create_scraper(cls, sess=None, **kwargs):
        """
        Convenience function for creating a ready-to-go CloudflareScraper object.
        """
        scraper = cls(**kwargs)

        if sess:
            attrs = ["auth", "cert", "cookies", "headers", "hooks", "params", "proxies", "data"]
            for attr in attrs:
                val = getattr(sess, attr, None)
                if val:
                    setattr(scraper, attr, val)

        return scraper

    ## Functions for integrating cloudflare-scrape with other applications and scripts

    @classmethod
    def get_tokens(cls, url, user_agent=None, **kwargs):
        scraper = cls.create_scraper()
        if user_agent:
            scraper.headers["User-Agent"] = user_agent

        try:
            resp = scraper.get(url, **kwargs)
            resp.raise_for_status()
        except Exception as e:
            logging.error("'%s' returned an error. Could not collect tokens." % url)
            raise

        domain = urlparse(resp.url).netloc
        cookie_domain = None

        for d in scraper.cookies.list_domains():
            if d.startswith(".") and d in ("." + domain):
                cookie_domain = d
                break
        else:
            raise ValueError("Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM (\"I'm Under Attack Mode\") enabled?")

        return ({
                    "__cfduid": scraper.cookies.get("__cfduid", "", domain=cookie_domain),
                    "cf_clearance": scraper.cookies.get("cf_clearance", "", domain=cookie_domain)
                },
                scraper.headers["User-Agent"]
               )

    @classmethod
    def get_cookie_string(cls, url, user_agent=None, **kwargs):
        """
        Convenience function for building a Cookie HTTP header value.
        """
        tokens, user_agent = cls.get_tokens(url, user_agent=user_agent, **kwargs)
        return "; ".join("=".join(pair) for pair in tokens.items()), user_agent

create_scraper = CloudflareScraper.create_scraper
get_tokens = CloudflareScraper.get_tokens
get_cookie_string = CloudflareScraper.get_cookie_string

I dunno... it seems like it works... but something is wrong...

pawliczka commented 5 years ago

I'm using #206 impl and I'm constantly getting KeyError: 'location'. With your impl the same :( Something else has changed I think

pawliczka commented 5 years ago

We are getting 403 code: image

VeNoMouS commented 5 years ago

@pawliczka yea confirmed, 403 for me as well.

VeNoMouS commented 5 years ago

I also noticed my s param, is always longer ie

3a6002246ad63c4993313cb0399bdbb8d0e9b45b-1553942508-1800-AbDVH86ld1XqRlLVE9OWGYQOVasTx6qOsfFLmhzyZnkx+QSWtR/E4MrwizZGjZW9QnofW4wm0DzHcJVZQh1U/ZRaq35yTt/2nkpRKwwbgo5erVnZ9xN+JWP4QLj7SKG76S2TQ3GMNP0x27IOkvOCiYQ=

than say burp..

3273ae847b6f60cb064e7e226833bb895e0a27aa-1553940805-1800-AVa7UnzBT0LN9tEsGhdyuYaJkEn1iQQAXJQeBZ3rcvm2gL8EUBuPGRkFbwGBIwdhhrs4ngAeCZPnudHrEUagggBAdI9BDvLXMme9lksX1Q3DcTkkPneTDg554HRjJ3cbvQ==

pawliczka commented 5 years ago

I think that this is something related to the '+' at the end of s. But i do not know that the + means :(( +JWP4QLj7SKG76S2TQ3GMNP0x27IOkvOCiYQ=

VeNoMouS commented 5 years ago

ignore that long vs short and + just did a fresh burp and got

12abc9621d577b480398b15fe6984c47533e560d-1553942936-1800-Afn+s585Y44v3vwSCBFlWbiLQIUPcCuY/JWwuUrdbWXMh8FtkF38FMVRcQ6fjM3xBxt6TWb0ap+nWvU1AhAQiTMZuutumksS6ScSEChw2xJo8x9efxR5jxeQH6KcY0anXA==

pawliczka commented 5 years ago

We are lost😒

VeNoMouS commented 5 years ago

@pawliczka i dumped out my burp response into a file and injected it into the cfscrape... this is my burp

image

this is my params payload in cfscrape


{'jschl_answer': '-10.8784096873',
 'jschl_vc': '0dd277f3ee7d26fd9fc79497b5d6a8d7',
 'pass': '1553942940.979-Xdc4gdP8aN',
 's': '12abc9621d577b480398b15fe6984c47533e560d-1553942936-1800-Afn+s585Y44v3vwSCBFlWbiLQIUPcCuY/JWwuUrdbWXMh8FtkF38FMVRcQ6fjM3xBxt6TWb0ap+nWvU1AhAQiTMZuutumksS6ScSEChw2xJo8x9efxR5jxeQH6KcY0anXA=='}```

it's identical as far as i can see..
pawliczka commented 5 years ago

@VeNoMouS ok I see. But you still have 403?

VeNoMouS commented 5 years ago

ok @pawliczka got it.. the parameters have to be in specific order...

ry:
            #params["jschl_vc"] = re.search(r'name="jschl_vc" value="(\w+)"', body).group(1)
            #params["pass"] = re.search(r'name="pass" value="(.+?)"', body).group(1)
            #params["s"] = re.search(r'name="s"\svalue="(?P<s_value>[^"]+)', body).group('s_value')
            submit_url = '{}?s={}&jschl_vc={}&pass={}&jschl_answer={}'.format(
                submit_url,
                re.search(r'name="s"\svalue="(?P<s_value>[^"]+)', body).group('s_value'),
                re.search(r'name="jschl_vc" value="(\w+)"', body).group(1),
                re.search(r'name="pass" value="(.+?)"', body).group(1),
                self.solve_challenge(body, domain)
            )

resulted in

{'Content-Length': '159', 'Server': 'cloudflare', 'Connection': 'keep-alive', 'Location': 'https://ww5.justdubs.me/css/style.css', 'Date': 'Sat, 30 Mar 2019 11:30:08 GMT', 'CF-RAY': '4bf9bff27ce5a41d-AKL', 'Content-Type': 'text/html', 'X-Frame-Options': 'SAMEORIGIN'}
pawliczka commented 5 years ago

OMG! Good job

VeNoMouS commented 5 years ago

I dunno something is broken for me still ... i been working on this too long its 1am, it looks like it auth's gives me a location, but keeps looping between 302 and 503... if you dont resolve it tonight ill try pick it up back up tomorrow.. but ive had enough for tonight.

pawliczka commented 5 years ago

I replaced params with OrderedDictionary. And now i got instant recapcha 😆

VeNoMouS commented 5 years ago

I did try that, but when i posted, it didnt look in order... shrug, least its going for everyone :)

Krylanc3lo commented 5 years ago

Thanks for helping. By using the code shared by @VeNoMouS and the params part, I get a 302 error as well.

@pawliczka, the OrderedDictionary part fixes everything ?

pawliczka commented 5 years ago

@Krylanc3lo nothing. Always captcha

Krylanc3lo commented 5 years ago

OK thanks

ghost commented 5 years ago

@VeNoMouS

String.prototype.italics = function () {
  return '<i>' + this + '</i>'
};

var empty = "";
console.log(empty.italics(), "xyz".italics()); // "<i></i> <i>xyz</i>"
ghost commented 5 years ago

@pawliczka

I think that this is something related to the '+' at the end of s. But i do not know that the + means :(( +JWP4QLj7SKG76S2TQ3GMNP0x27IOkvOCiYQ=

Base64 as the name suggests is a base 64 number system that utilizes 64 digits. The "+" is the second to last digit. If we're treating the chars strictly as digits then the decimal(base 10) representation of "+" is 63. The "=" is used for padding. See: https://stackoverflow.com/questions/6916805/why-does-a-base64-encoded-string-have-an-sign-at-the-end

Digits: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

VeNoMouS commented 5 years ago

I’m just on a road trip with my gf today, but I will pick this back up later tonight and investigate further past where I got up to last night :)

sudovijay commented 5 years ago

even after same query strings, headers it just don't give clearance, just recapcha if we pass same user agent or referrer. ending with 302 or 403 always.