Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.38k stars 460 forks source link

Cloudflare Update. #7

Closed zacharyrs closed 9 years ago

zacharyrs commented 9 years ago

I'm currently having issues using your script, the same as StevenVeshkini had(/s). From what I've gathered there seem to be eight 'setTimeout's on the page causing your script to get confused. I think the one that your script wants is the eighth but as a scriptnoob I have no idea how to implement the correct search.

Thanks in advance, plasmaboltrs

zacharyrs commented 9 years ago

Resolved it. Well, I patched my copy of the script and it seems to work. Here's a copy of the code to see if it works for anyone else.

import re
import PyV8
from urlparse import urlparse
import requests
from requests.adapters import HTTPAdapter

DEFAULT_USER_AGENT = ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36")

class CloudflareAdapter(HTTPAdapter):
    def send(self, request, **kwargs):
        domain = request.url.split("/")[2]
        resp = super(CloudflareAdapter, self).send(request, **kwargs)

        # Check if we already solved a challenge
        if request._cookies.get("cf_clearance", domain="." + domain):
            return resp

        # Check if Cloudflare anti-bot is on
        if "a = document.getElementById('jschl-answer');" in resp.content:
            return self.solve_cf_challenge(resp, request.headers, **kwargs)

        # Otherwise, no Cloudflare anti-bot detected
        return resp

    def add_headers(self, request):
        # Spoof Chrome on Linux if no custom User-Agent has been set
        if "requests" in request.headers["User-Agent"]:
            request.headers["User-Agent"] = DEFAULT_USER_AGENT

    def solve_cf_challenge(self, resp, headers, **kwargs):
        headers = headers.copy()
        url = resp.url
        parsed = urlparse(url)
        domain = parsed.netloc
        page = resp.content
        kwargs.pop("params", None) # Don't pass on params

        try:
            # Extract the arithmetic operation
            challenge = re.search(r'name="jschl_vc" value="(\w+)"', page).group(1)
            builder = re.search(r"setTimeout.+?\r?\n([\s\S]+?a\.value =.+?)\r?\n", page).group(1)
            builder = '\n'.join(builder.split('\n')[12:]) #Here's what I changed.
            builder = re.sub(r"a\.value =(.+?) \+ .+?;", r"\1", builder)
            builder = re.sub(r"\s{3,}[a-z](?: = |\.).+", "", builder)
            #print(builder)

        except AttributeError:
            # Something is wrong with the page. This may indicate Cloudflare has changed their
            # anti-bot technique. If you see this and are running the latest version,
            # please open a GitHub issue so I can update the code accordingly.
            raise IOError("Unable to parse Cloudflare anti-bots page. Try upgrading cfscrape, or "
                          "submit a bug report if you are running the latest version.")

        # Lock must be added explicitly, because PyV8 bypasses the GIL
        with PyV8.JSLocker():
            with PyV8.JSContext() as ctxt:
                # Safely evaluate the Javascript expression
                answer = str(int(ctxt.eval(builder)) + len(domain))

        params = {"jschl_vc": challenge, "jschl_answer": answer}
        submit_url = "%s://%s/cdn-cgi/l/chk_jschl" % (parsed.scheme, domain)
        headers["Referer"] = url

        return requests.get(submit_url, params=params, headers=headers, **kwargs)

def create_scraper(session=None):
    """
    Convenience function for creating a ready-to-go requests.Session object.
    You may optionally pass in an existing Session to mount the CloudflareAdapter to it.
    """
    sess = session or requests.session()
    adapter = CloudflareAdapter()
    sess.mount("http://", adapter)
    sess.mount("https://", adapter)
    return sess

You can see the part I added as indicated. Guess it was simple enough for me after all.

ne0ark commented 9 years ago

Can send the patch upstream.

Anorov commented 9 years ago

For some reason, the few Cloudflare-protected pages I tested on are still showing the old page. I only see 1 setTimeout. Could you give me a link to a site exhibiting this behavior, as well as a Pastebin or Gist containing the HTML source?

Anorov commented 9 years ago

Resolved with latest commit.

zacharyrs commented 9 years ago

Thanks