Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.33k stars 455 forks source link

Captcha issues #235

Closed Anorov closed 5 years ago

Anorov commented 5 years ago

The latest version of cfscrape should not encounter captchas, unless you're using Tor or another IP that Cloudflare has blacklisted. If you're getting a captcha error, first please run pip install -U cfscrape and try again. If you're still getting an error, please leave a comment.


Please put all captcha challenge-related issues here.

Please run the following to determine the OpenSSL version compiled with your Python binary and include the output in your comment:

$ python3 -c 'import ssl; print(ssl.OPENSSL_VERSION)'
OpenSSL 1.1.1b  26 Feb 2019

(Or python instead of python3 if running on Python 2.)

Anorov commented 5 years ago
$ python3 -c 'import ssl; print(ssl.OPENSSL_VERSION)'
OpenSSL 1.1.1b  26 Feb 2019

@pro-src Despite having OpenSSL 1.1.1, I'm getting a captcha every time I try to do a scraper.get("https://pro-src.com"). I had the same issue with an older version of OpenSSL. I'm on a normal residential connection, and never experienced any captcha issues using cloudflare-scrape in the past.

It appears Cloudflare's recently started doing much more aggressive anomaly and bot detection. They may be checking for discrepancies between the network traffic (TCP, SSL) fingerprints and the ostensible user-agent's known legitimate fingerprints, among other things.

When I have some more free time, I'll dig into this as well.

ghost commented 5 years ago

@Anorov @lukele and anybody else who is experiencing this issue.

I'm unable to reproduce this so I'm going to need you guys to generate some reports with this script.

This must be ran from within the cloned cloudflare-scrape repo if you haven't pip installed cfscrape

git clone https://gist.github.com/pro-src/17654ec3f949b0b17bd1a4aa1b4136b9 temp
cp temp/report.py report.py
python report.py

My report: http://dpaste.com/38GSGJM Updated: http://dpaste.com/0RAENRJ

lukele commented 5 years ago

@pro-src thanks, just ran it. Seeing

$ python3 report.py 
Already reported ID detected.

If I disable mounting of the CustomAdapter I'm seeing the CaptchaError again.

ghost commented 5 years ago

Unfortunately, it means your report is exactly the same as mine and nothing unique was identified. The CustomAdapter is checking to see if the problem occurs when using TLSv1.1 which it doesn't. I'll need to update the script to use ssllabs since it has better inspections. The clientHello extensions are still the greatest suspect.

lukele commented 5 years ago

Hmm... seeing that the user-agent is still set randomly, we should use a single one for these tests, to eliminate any differences. It might not be related to that, but can‘t hurt.

Just realized, of course you thought of that.

ghost commented 5 years ago

@lukele I've updated the script to address the issues that I mentioned but I expect the same result. If we get the same result, I'll update it to use ssllabs since it has better inspections.

Typical output if not unique: https://gist.github.com/pro-src/5e603aee2fc8d183624be6d3fda2b7eb

lukele commented 5 years ago

Sure looks like same result. Only difference is

Cloudflare responded with CAPTCHA under normal conditions

which is to be expected.

lukele commented 5 years ago

@pro-src Seeing that my latest Chrome version that supports TLS1.3 also uses http/2 for some connections (Cloudflare ESNI Checker - https://73af10a0-12b8-44bb-a685-6814f3c71e76.encryptedsni.com/cdn-cgi/trace). So in order to mimic Chrome as closely as possible, I've mounted a HTTP2 adapter (from hyper) for requests in a test script, and lo and behold, no more captcha challenge. The cipher list however seems to remain the same

import cfscrape
from collections import OrderedDict 

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
headers = OrderedDict(
    (
        ("Host", None),
        ("Connection", "keep-alive"),
        ("Upgrade-Insecure-Requests", "1"),
        ("User-Agent", user_agent),
        (
            "Accept",
            "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        ),
        ("Accept-Encoding", "gzip, deflate"),
        ("Accept-Language", "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"),

    )
)
from hyper.contrib import HTTP20Adapter
scraper = cfscrape.create_scraper()
scraper.mount('https://', HTTP20Adapter())
scraper.get("https://pro-src.com", headers=headers)

Not sure what to make of it yet.

ghost commented 5 years ago

:thinking: Me neither. If that was solely the problem, wouldn't we expect everybody to be getting a CAPTCHA? It could be a reasonable work around for anybody who does have this problem though. Does the adapter fallback to HTTP 1.1 when HTTP 2 isn't supported? Does switching the adapter on and off whenever solving the challenge work to bypass the CAPTCHA?

For example:

from requests.adapters import HTTPAdapter
custom = HTTPAdapter()
original = scraper.get_adapter('https://')
# These should be the same by default
assert scraper.get_adapter('http://') is original
scraper.mount('https://', custom)
scraper.mount('http://', custom)
assert scraper.get_adapter('https://') is custom
assert scraper.get_adapter('http://') is custom
# Switch back
# scraper.mount('https://', original)
# scraper.mount('http://', original)
# I'm only showing an alternative to calling `mount` here
scraper.adapters.update({ 'https://': original, 'http://': original })
lukele commented 5 years ago

Before looking into that, two questions:

1.) Do you have pyopenssl installed? 2.) What is the output of the following code

import urllib3.contrib.pyopenssl
print(urllib3.contrib.pyopenssl)
ghost commented 5 years ago

Python 2 and 3

>>> import urllib3.contrib.pyopenssl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/urllib3/contrib/pyopenssl.py", line 46, in <module>
    import OpenSSL.SSL
ModuleNotFoundError: No module named 'OpenSSL'
lukele commented 5 years ago

Ah ok, thanks. So that's the same as well.

ghost commented 5 years ago

The script has been updated to check ssllabs which provides protocol details such as the signature algorithms that are being used. I've updated my report for comparison: http://dpaste.com/0RAENRJ

While I could be wrong, I think that if the test yields the same results for ssllabs then it's likely not the TLS/SSL causing the problem or the IP's that I'm using are exempt from the checks. If the results are the same and it is something to do with the TLS/SSL, IDK that we can narrow it down without recompiling openssl. So, I'm really hoping that something shows...

Also if you edit the script to remove the known hashes, you'll generate a full report even if it's not unique.

lukele commented 5 years ago

I am in fact suspecting that you are whitelisted somehow. But you mentioned you performed tests from different hosts, do I recall that correctly?

hyper uses the default ciphers from the ssl module. If I replace urllib3s ciphers with those from hyper (so the default ciphers from the ssl module) I'm not seeing the captcha.

If I remove ECDHE-RSA-AES256-GCM-SHA384 from the urllib3s cipher list, I don't see the captcha any more... 🤷‍♂

lukele commented 5 years ago

Ran your report script. there's a very subtle difference. SHA1 in signature algorithms: SHA1/ECDSA, SHA1/RSA, SHA1/DSA

http://dpaste.com/0ZVTNAQ

ghost commented 5 years ago

Yep, I'm tunneling through heroku to test as well.

If I remove ECDHE-RSA-AES256-GCM-SHA384 from the urllib3s cipher list, I don't see the captcha any more... :man_shrugging:

A single cipher removed from the list fixes this. :thinking: Maybe it isn't the particular cipher but the list as a whole that Cloudflare is flagging?

ECDHE-RSA-AES256-GCM-SHA384 is used in chrome as well. You'll see TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 in this screenshot:

Screenshot ![www ssllabs com_ssltest_viewMyClient html](https://user-images.githubusercontent.com/34285059/57589001-fd24f780-74e2-11e9-9ca6-d679bf61a96b.png)

I think maybe we should just brute force it. I mean, that would identify how this is working wouldn't it? lol

ghost commented 5 years ago

@lukele I'll whip up something later on if you want to run it for us? :P

lukele commented 5 years ago

A single cipher removed from the list fixes this. 🤔 Maybe it isn't the particular cipher but the list as a whole that Cloudflare is flagging?

That's what I've been thinking before. That urllib3 has a signature in that sense. But I can't remove any cipher. So for example, removing any of the first 4 (TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256:ECDHE-ECDSA-AES256-GCM-SHA384) doesn't change a thing.

ECDHE-RSA-AES256-GCM-SHA384 is used in chrome as well.

Haha, I check the same as well when I found the cipher :)

lukele commented 5 years ago

I'll try to remove ciphers from the end of the list, which are weaker ciphers anyway. Still would love to know why, a) you are not seeing the same results and b) what the real reason is.

ghost commented 5 years ago

Because Cloudflare loves me :rofl: Seriously though, I want to know as well.

ghost commented 5 years ago

Ran your report script. there's a very subtle difference. SHA1 in signature algorithms: SHA1/ECDSA, SHA1/RSA, SHA1/DSA

I just saw this, thanks.

lukele commented 5 years ago

Alright, *** it. Removing this cipher AES128-SHA from the list, would be enough on my system. Should we just add an adapter that removes this cipher and see how many still report captchas?

ghost commented 5 years ago

@lukele Let's see if this solves it, SHA1 is insecure anyway.

import cfscrape
import urllib3

from requests.adapters import HTTPAdapter
from urllib3.util.ssl_ import create_urllib3_context, DEFAULT_CIPHERS

DEFAULT_CIPHERS += ':!SHA1'
# urllib3.util.ssl_.DEFAULT_CIPHERS = DEFAULT_CIPHERS

class CustomAdapter(HTTPAdapter):
    def init_poolmanager(self, *args, **kwargs):
        ctx = create_urllib3_context(ciphers=DEFAULT_CIPHERS)
        super(CustomAdapter, self).init_poolmanager(*args, ssl_context=ctx, **kwargs)

scraper = cfscrape.create_scraper()
scraper.mount('https://', CustomAdapter())
print(scraper.get('https://pro-src.com').content)
lukele commented 5 years ago

Sure does!

ghost commented 5 years ago

Alright, *** it. Removing this cipher AES128-SHA from the list, would be enough on my system. Should we just add an adapter that removes this cipher and see how many still report captchas?

If nothing else avails, I think we should make a switch. I'm not sure about the on/off by default....

ghost commented 5 years ago

@lukele I'm going to modify the script so we can see exactly how this affects the report.

lukele commented 5 years ago

Could you share your cipher list once again, so the output of the following script

from urllib3.util.ssl_ import create_urllib3_context
ctx = create_urllib3_context(ssl.PROTOCOL_SSLv23)
print([c['name'] for c in ctx.get_ciphers()])
ghost commented 5 years ago

I just used pprint.pprint instead of print. I think the difference in the signature extension is based on CPU. It would be nice if we could identify the CPU's that cause this problem as well.

Ciphers ```py ['TLS_AES_256_GCM_SHA384', 'TLS_CHACHA20_POLY1305_SHA256', 'TLS_AES_128_GCM_SHA256', 'ECDHE-ECDSA-AES256-GCM-SHA384', 'ECDHE-RSA-AES256-GCM-SHA384', 'ECDHE-ECDSA-AES128-GCM-SHA256', 'ECDHE-RSA-AES128-GCM-SHA256', 'ECDHE-ECDSA-CHACHA20-POLY1305', 'ECDHE-RSA-CHACHA20-POLY1305', 'DHE-DSS-AES256-GCM-SHA384', 'DHE-RSA-AES256-GCM-SHA384', 'DHE-DSS-AES128-GCM-SHA256', 'DHE-RSA-AES128-GCM-SHA256', 'DHE-RSA-CHACHA20-POLY1305', 'ECDHE-ECDSA-AES256-CCM8', 'ECDHE-ECDSA-AES256-CCM', 'ECDHE-ECDSA-AES256-SHA384', 'ECDHE-RSA-AES256-SHA384', 'ECDHE-ECDSA-AES256-SHA', 'ECDHE-RSA-AES256-SHA', 'DHE-RSA-AES256-CCM8', 'DHE-RSA-AES256-CCM', 'DHE-RSA-AES256-SHA256', 'DHE-DSS-AES256-SHA256', 'DHE-RSA-AES256-SHA', 'DHE-DSS-AES256-SHA', 'ECDHE-ECDSA-AES128-CCM8', 'ECDHE-ECDSA-AES128-CCM', 'ECDHE-ECDSA-AES128-SHA256', 'ECDHE-RSA-AES128-SHA256', 'ECDHE-ECDSA-AES128-SHA', 'ECDHE-RSA-AES128-SHA', 'DHE-RSA-AES128-CCM8', 'DHE-RSA-AES128-CCM', 'DHE-RSA-AES128-SHA256', 'DHE-DSS-AES128-SHA256', 'DHE-RSA-AES128-SHA', 'DHE-DSS-AES128-SHA', 'AES256-GCM-SHA384', 'AES128-GCM-SHA256', 'AES256-CCM8', 'AES256-CCM', 'AES128-CCM8', 'AES128-CCM', 'AES256-SHA256', 'AES128-SHA256', 'AES256-SHA', 'AES128-SHA'] ```
lukele commented 5 years ago

Intresting, for some reason it appears that my system is sending the SHA1 options and yours is not, even though the same cipher set is given.

ghost commented 5 years ago

I think openssl defaults are based on CPU, don't quote me though. :smiley:

lukele commented 5 years ago

Ah interesting, I have a Core i5 8th generation (Amber Lake-Y)

lukele commented 5 years ago

It looks like openssl can be build without sha1. That would explain it as well. Are you on a linux system?

ghost commented 5 years ago

Yup, always on *nix. I never use anything else. I updated the script and it changed my cipher list: http://dpaste.com/3BHAPM7 I'm not sure what the diff is yet but the diff was only reported by the socket.

lukele commented 5 years ago

What are you seeing for

openssl version
openssl ciphers

Version should be the same (OpenSSL 1.1.1b 26 Feb 2019) but I reckon ciphers might be different

ghost commented 5 years ago
Ciphers TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:DHE-RSA-AES256-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES128-SHA:RSA-PSK-AES256-GCM-SHA384:DHE-PSK-AES256-GCM-SHA384:RSA-PSK-CHACHA20-POLY1305:DHE-PSK-CHACHA20-POLY1305:ECDHE-PSK-CHACHA20-POLY1305:AES256-GCM-SHA384:PSK-AES256-GCM-SHA384:PSK-CHACHA20-POLY1305:RSA-PSK-AES128-GCM-SHA256:DHE-PSK-AES128-GCM-SHA256:AES128-GCM-SHA256:PSK-AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:ECDHE-PSK-AES256-CBC-SHA384:ECDHE-PSK-AES256-CBC-SHA:SRP-RSA-AES-256-CBC-SHA:SRP-AES-256-CBC-SHA:RSA-PSK-AES256-CBC-SHA384:DHE-PSK-AES256-CBC-SHA384:RSA-PSK-AES256-CBC-SHA:DHE-PSK-AES256-CBC-SHA:AES256-SHA:PSK-AES256-CBC-SHA384:PSK-AES256-CBC-SHA:ECDHE-PSK-AES128-CBC-SHA256:ECDHE-PSK-AES128-CBC-SHA:SRP-RSA-AES-128-CBC-SHA:SRP-AES-128-CBC-SHA:RSA-PSK-AES128-CBC-SHA256:DHE-PSK-AES128-CBC-SHA256:RSA-PSK-AES128-CBC-SHA:DHE-PSK-AES128-CBC-SHA:AES128-SHA:PSK-AES128-CBC-SHA256:PSK-AES128-CBC-SHA
lukele commented 5 years ago

Ok, so ciphers are the same as mine.

The report also shows no difference, but disabling SHA1 with !SHA1 solves the captcha problem (for me). Certainly a workaround we should be able to live with, even if it modifies the cipher suite

ghost commented 5 years ago

Alright so the diff is all of the SHA1 cipher suites are removed as reported by the socket only?

('ECDHE-ECDSA-AES256-SHA', 'TLSv1.0', 256)
('ECDHE-RSA-AES256-SHA', 'TLSv1.0', 256)
('DHE-RSA-AES256-SHA', 'SSLv3', 256)
('DHE-DSS-AES256-SHA', 'SSLv3', 256)
('ECDHE-ECDSA-AES128-SHA', 'TLSv1.0', 128)
('ECDHE-RSA-AES128-SHA', 'TLSv1.0', 128)
('DHE-RSA-AES128-SHA', 'SSLv3', 128)
('DHE-DSS-AES128-SHA', 'SSLv3', 128)
('AES256-SHA', 'SSLv3', 256)
('AES128-SHA', 'SSLv3', 128)

This doesn't affect the ciphers that are being shared when using openssl 1.1.1b so what to do? Remove only AES128-SHA or all SHA1?

ghost commented 5 years ago

@lukele The ssllabs report didn't change when using the updated script? Nvm, I failed to update it properly, give me one sec... lol

lukele commented 5 years ago

Only the most recent version includes the Shared ciphers as reported by the socket information, which the previous versions unfortunately didn't. Signature algorithms as reported by ssllabs are identical in all reports.

ghost commented 5 years ago

Okay, I fixed the script. Sorry about that. Here is my updated report: http://dpaste.com/0D2047F You can copy and paste my hashes into the script's known list if you want to confirm a match.

ghost commented 5 years ago

@lukele Actually !AES128-SHA will match multiple cipher suites but a few less than !SHA1. @Anorov Thoughts?

lukele commented 5 years ago

I'd go with least invasive and only remove AES128-SHA at first. We can still add to the list later on. So basically such an adapter should do:

from urllib3.util.ssl_ import create_urllib3_context

class CustomCiphers(HTTPAdapter):
    def __init__(self, ciphers_to_remove, *args, **kwargs):
         self.ciphers_to_remove = ciphers_to_remove
         super(CustomCiphers, self).__init__(*args, **kwargs)

    def init_poolmanager(self, *args, **kwargs):
        ctx = create_urllib3_context()
        ciphers = [cipher['name'] for cipher in ctx.get_ciphers() if cipher['name'] not in self.ciphers_to_remove]
        ctx.set_ciphers(":".join(ciphers)
        super(StopBreaking, self).init_poolmanager(*args, ssl_context=ctx, **kwargs)

scraper = cfscrape.create_scraper()
scraper.mount('https://', CustomCiphers(['AES128-SHA']))
lukele commented 5 years ago

This is my latest report: http://dpaste.com/1ST48R1.txt Still only difference is in Signature algorithms. I have added your hashes.

ghost commented 5 years ago

Well that's odd. Same hash: 402014b899136c3fed09cd745dc01355 :man_shrugging:

lukele commented 5 years ago

Ah ok, that hash wasn't included in your latest report. That is indeed odd...

ghost commented 5 years ago

My sigs hash didn't change. I was expecting yours to change though... Neither of them changed?

ghost commented 5 years ago

Do you want to send a PR? It might be a 100% fix.

lukele commented 5 years ago

Working on it at the moment. It might be overkill, but how about mounting the adapter on initialization, but unmounting it, once we have the cookie?

ghost commented 5 years ago

Sounds great to me!

lukele commented 5 years ago

Alright, great. I'll send you a request for feedback once I have the pull request up. Looking forward to seeing if it solves the captcha issue for @Anorov too.