Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.34k stars 456 forks source link

Captcha issues #235

Closed Anorov closed 5 years ago

Anorov commented 5 years ago

The latest version of cfscrape should not encounter captchas, unless you're using Tor or another IP that Cloudflare has blacklisted. If you're getting a captcha error, first please run pip install -U cfscrape and try again. If you're still getting an error, please leave a comment.


Please put all captcha challenge-related issues here.

Please run the following to determine the OpenSSL version compiled with your Python binary and include the output in your comment:

$ python3 -c 'import ssl; print(ssl.OPENSSL_VERSION)'
OpenSSL 1.1.1b  26 Feb 2019

(Or python instead of python3 if running on Python 2.)

lukele commented 5 years ago

@pro-src, @Anorov PR https://github.com/Anorov/cloudflare-scrape/pull/242 is up and should hopefully fix the captcha problem. Would love some feedback. Also if you have a better idea for the adapter name, I'm all ears.

ghost commented 5 years ago

I actually think there is a couple of TLS related problems that is causing the CAPTCHA. The absence of TLSv1.3 ciphers aka old versions of openssl and the current issue with some SHA1 ciphers. I think we should address both. Addressing the former increases security and avoids a CAPTCHA in some instances, the latter resolves a CAPTCHA even when using the latest openssl in some instances.

The problem here is that some instances is not very conclusive. We need more feedback. Also, if we could identify individual problematic ciphers that would be great. We've been unable to do that in a way that makes sense thus far.

Guessing Hopefully, we can avoid guessing at the problem like was done here: https://github.com/VeNoMouS/cloudscraper/commit/6f84c4c99f5a21a1cf8d1e49fa2f5d41c337ad7f > Add cipher suite back in but adjust so tls 1.2 doesn't use tls 1.3 ciphers. The ciphers in that commit are TLSv1.2 ciphers and extensions. There is nothing TLSv1.3 specific about them. OpenSSL treats TLSv1.3 ciphers differently and thus they should be prefixed accordingly i.e. `TLS13-CHACHA20-POLY1305-SHA256`. ```py if hasattr(ssl, 'PROTOCOL_TLSv1_3'): ciphers.insert(0, ['GREASE_3A', 'GREASE_6A', 'AES128-GCM-SHA256', 'AES256-GCM-SHA256', 'AES256-GCM-SHA384', 'CHACHA20-POLY1305-SHA256']) ``` Thus the change actually reads as *"adjust so tls 1.2 doesn't use tls 1.2 ciphers"* which is clearly wrong. And is obvious from the docs as well: https://www.openssl.org/docs/man1.0.2/man1/ciphers.html#TLS-v1.2-cipher-suites

Until then, I think !SHA1 is our best candidate since it doesn't appear to cause any problems. I'll continue to look into this.

ghost commented 5 years ago

@lukele I do think that we've narrowed it down to https://github.com/Anorov/cloudflare-scrape/issues/235#issuecomment-491641027 between ourselves. I wonder if this could be as simple as some SSLv3 option.

Do you mind trying to remove these TLSv1.0 ciphers?

('ECDHE-ECDSA-AES256-SHA', 'TLSv1.0', 256)
('ECDHE-RSA-AES256-SHA', 'TLSv1.0', 256)
('ECDHE-ECDSA-AES128-SHA', 'TLSv1.0', 128)
('ECDHE-RSA-AES128-SHA', 'TLSv1.0', 128)

DEFAULT_CIPHERS += ':!ECDHE-ECDSA-AES256-SHA:!ECDHE-RSA-AES256-SHA:!ECDHE-ECDSA-AES128-SHA:!ECDHE-RSA-AES128-SHA'

I'll fiddle in moment as I'm not sure whether or not to include the TLS prefix.

lukele commented 5 years ago

@pro-src Sure, will try. With what versions of python are you seeing the problem and what sites are you testing against?

The ciphers listed are also covered with :!SHA1, right? Or does that still include too many options?

ghost commented 5 years ago

I've only tested again my domain and https://ssllabs.com/ssltest/viewMyClient.html The latter has the handshake error on python 2 and 3.

The ciphers listed are also covered with :!SHA1, right? Or does that still include too many options?

Yes to both of those questions. Hopefully, we can be more specific without causing problems with SSLv3 but for right now, it works.

ghost commented 5 years ago

@lukele

The following works to remove the TLSv1.0 ciphers of recent discussion: DEFAULT_CIPHERS += ':!ECDHE+SHA' and is more specific. Easiest way to test that is to modify the report.py.

Here's mine: http://dpaste.com/1NJSY4B

lukele commented 5 years ago

Do you have a working proxy by any chance to test my changes? The problem is that we can't enable/disable the adapter on demand if we're relying on init_poolmanager and proxy_manager_for as these are invoked when the adapter is mounted.

Instead I've moved the logic into def get_connection(self, url, proxies=None):

def get_connection(self, url, proxies=None):
        conn = super(CaptchaProvokingCiphersRemover, self).get_connection(url, proxies)
        if self.is_enabled:
            print("Insert custom SSL context")
            conn.conn_kw['ssl_context'] = self.context_without_problematic_ciphers()
        else:
            print("Use default SSL context.")

        return conn

It does work for non-proxied connection, but have yet to test with a proxy.

lukele commented 5 years ago

re. your report.py script. Just noticed that you modify the default DEFAULT_CIPHERS list at the very beginning, and I'm now seeing No CAPTCHA encountered under normal conditions. which technically is not correct. Does the report still contain the important information, though?

lukele commented 5 years ago

This Is my report: http://dpaste.com/22Y6P4Q

ghost commented 5 years ago

I find that to be an odd side effect, it should still contain the important information. If you look at lines around L113: https://gist.github.com/pro-src/17654ec3f949b0b17bd1a4aa1b4136b9 You'll see that the adapters aren't being mounted. I went ahead and modified it a tiny bit to avoid confusion.

ghost commented 5 years ago

@lukele No CAPTCHA when using DEFAULT_CIPHERS += ':!ECDHE+SHA'? Just to be clear on what was tested.

I sent you an email with instructions and proxy credentials. You can also use proxychains for a better test, just configure proxychains to use 127.0.0.1:1080 instead of the default.

brew install proxychains-ng
proxychains4 -q python testing.py
lukele commented 5 years ago

If you modify the DEFAULT_CIPHERS directly, any instances of ssl_context will be affected, due to the nature of python variables

lukele commented 5 years ago

The report was created using this version of report.py https://gist.github.com/pro-src/17654ec3f949b0b17bd1a4aa1b4136b9/4d7ba5c8593ef23ea6f4405cb855670d9d1a3d1d which modified the DEFAULT_CIPHERS directly. So even with no adapter in place, the SHA1 ciphers would have been eliminated (http://dpaste.com/22Y6P4Q). I just realized they were also created with a patched version of cfscrape, so never mind.

Newest report with your latest version non-patched cfscrape: http://dpaste.com/3EZ6VVE

ghost commented 5 years ago

Ah, I see but the DEFAULTCIPHERS were never being modified directly since strings in python are immutable and no assignment to `urlib3.util.ssl` takes place.

import urllib3
from urllib3.util.ssl_ import DEFAULT_CIPHERS

DEFAULT_CIPHERS += 'foobar';
print(urllib3.util.ssl_.DEFAULT_CIPHERS == DEFAULT_CIPHERS) # prints False

Did it trigger a CAPTCHA? I probably should modify it to include those details in the saved report rather than just in the shell. Sorry about that.

lukele commented 5 years ago

My bad, you are absolutely correct. I've mistaken it with the use of

urllib3.util.ssl_.DEFAULT_CIPHERS += 'foobar'

Yes, it did trigger a captcha

Shell output

``` Checking GET request for https://pro-src.com DEFAULT Checking GET request for https://pro-src.com DEFAULT Checking GET request for https://pro-src.com DEFAULT Checking GET request for https://pro-src.com DEFAULT Cloudflare responded with CAPTCHA under normal conditions Checking to see which ciphers are shared as reported by https://howsmyssl.com Nothing unique was reported by https://howsmyssl.com Checking to see which ciphers are shared as reported by ssllabs The shared ciphers reported by ssllabs are not unique. The protocols details reported by ssllabs are not unique. Unique signature algorithms were detected by ssllabs. The named groups reported by ssllabs are not unique. Checking GET request for https://pro-src.com TLSv1.1 Checking GET request for https://pro-src.com TLSv1.1 No CAPTCHA encountered when using TLSv1.1 Checking GET request for https://pro-src.com DEFAULT !SHA1 Checking GET request for https://pro-src.com DEFAULT !SHA1 No CAPTCHA encountered when using !SHA1 The report was saved locally as "report.md" The report is valid Github flavored markdown, you may copy and paste it. The dpaste link (Expires in 10 days): http://dpaste.com/3EZ6VVE ```

ghost commented 5 years ago

:thinking: But you modified it to replace !SHA1 with !ECDHE+SHA correct?

Checking GET request for https://pro-src.com DEFAULT !SHA1 Checking GET request for https://pro-src.com DEFAULT !SHA1 No CAPTCHA encountered when using !SHA1

So it didn't trigger a CAPTCHA when using !ECDHE+SHA after all since !SHA1 really means !ECDHE+SHA in this case?

lukele commented 5 years ago

Ah no, my bad. This was based on your original version.

Following report is with !ECDHE+SHA

MODIFIED_CIPHERS = DEFAULT_CIPHERS + ':!ECDHE+SHA'

Report: http://dpaste.com/2Z2DRXK

report.py output

``` Checking GET request for https://pro-src.com DEFAULT Checking GET request for https://pro-src.com DEFAULT Cloudflare responded with CAPTCHA under normal conditions Checking to see which ciphers are shared as reported by https://howsmyssl.com Checking to see which ciphers are shared as reported by ssllabs The protocols details reported by ssllabs are not unique. Unique signature algorithms were detected by ssllabs. The named groups reported by ssllabs are not unique. Checking GET request for https://pro-src.com TLSv1.1 Checking GET request for https://pro-src.com TLSv1.1 No CAPTCHA encountered when using TLSv1.1 Checking GET request for https://pro-src.com DEFAULT !SHA1 Unique cipher list was shared with the server. Checking GET request for https://pro-src.com DEFAULT !SHA1 No CAPTCHA encountered when using !SHA1 The report was saved locally as "report.md" The report is valid Github flavored markdown, you may copy and paste it. The dpaste link (Expires in 10 days): http://dpaste.com/2Z2DRXK ```

So that means, confirmed. No captcha with !ECDHE+SHA

ghost commented 5 years ago

Cool so this seems to be the most specific we've had it yet. Cloudflare doesn't seem to like the absence of TLSv1.3 in some cases and the inclusion of TLSv1.0 in others. Seems like it just prefers secure settings which isn't a bad thing.

lukele commented 5 years ago

Hehe, yeah. Certainly a trade off we should be able to live with :) Considering that you never saw the captcha in the first place (or did you?), how did you figure out the "inclusion of TLSv1.0" part?

ghost commented 5 years ago

Yeah, I never received a CAPTCHA.

how did you figure out the "inclusion of TLSv1.0" part?

!ECDHE+SHA is a shorthand for removing the TLSv1.0 ciphers in this https://github.com/Anorov/cloudflare-scrape/issues/235#issuecomment-491876966

lukele commented 5 years ago

Aah, the SSLv3 error, now I recall.

lukele commented 5 years ago

So finally. Newest version is up. Seems to even work with openssl < 1.1.1 and python2.7

Tested with the following versions:

Python 2.7.15
OpenSSL 1.0.2r  26 Feb 2019
Python 3.7.3
OpenSSL 1.1.1b  26 Feb 2019
ghost commented 5 years ago

This addresses all presently known issues and potentially prevents CAPTCHA with openssl <= 1.1.0.

from urllib3.util.ssl_ import DEFAULT_CIPHERS
import ssl

TLS13_CIPHERS = ":".join([
    "TLS13-AES-256-GCM-SHA384",
    "TLS13-CHACHA20-POLY1305-SHA256",
    "TLS13-AES-128-GCM-SHA256"           
])

# Adjust the defaults to match those of more recent openssl versions
if ssl.OPENSSL_VERSION_NUMBER < 0x10101000 and "TLS13" not in DEFAULT_CIPHERS:
    DEFAULT_CIPHERS = TLS13_CIPHERS + ":" + DEFAULT_CIPHERS

# This removes a few problematic TLSv1.0 ciphers
DEFAULT_CIPHERS += ":!ECDHE+SHA"

# This is how a user could disable it
import cfscrape
cfscrape.DEFAULT_CIPHERS = None

But I still have a couple of things that I want to look at before recommending this. Recommended.

lukele commented 5 years ago

Hmm... what's the reasoning behind adding the TLSv1.3 ciphers? For older versions of urllib3? Do openssl versions < 1.1.1 support TLSv1.3 ciphers?

ghost commented 5 years ago

@lukele I believe so, just the default configuration improved. https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_cipher_list.html https://github.com/codemanki/cloudscraper/pull/212

lukele commented 5 years ago

Ah ok, just found this: https://wiki.openssl.org/index.php/TLS1.3

With my latest commits, a user could disable the custom ciphers using


from requests.adapters import HTTPAdapter
import cfscrape
scraper = cfscrape.create_scraper()
scraper.mount("https://", HTTPAdapter())

or we could make this configurable via create_scraper()

ghost commented 5 years ago

I'm currently -1 on adding keywords arguments for this purpose.

ghost commented 5 years ago

Actually, I still need to determine if the TLS13 prefix is understood only by openssl >= 1.1.1. If so than the prefix should be (what I think is) the more universal way of specifying the same: TLS.

aka TLS13 prefix VS. TLS prefix.

lukastribus commented 5 years ago

I don't get what you are saying about TLSv1.3.

The API to set ciphers for TLSv1.3 is different (SSL_CTX_set_ciphersuites()) than TLS <= v1.2 (SSL_CTX_set_cipher_list()). By restricting TLS <= v1.2 ciphers via SSL_CTX_set_cipher_list(), there is no impact whatsoever on TLSv1.3, it will keep using OpenSSL defaults.

There are no TLS13 prefixed ciphers and there is no such thing as TLS13-AES-256-GCM-SHA384 in OpenSSL.

As for urllib3: https://github.com/urllib3/urllib3/blob/master/src/urllib3/util/ssl_.py#L94

NOTE: TLS 1.3 cipher suites are managed through a different interface not exposed by CPython (yet!) and are enabled by default if they're available.

ghost commented 5 years ago

The prefix is taken from the DEFAULT_CIPHERS found in urllib3 and it seems that you get exactly what I'm saying as you seem to have just clarified exactly what I needed to determine. :smiley:

I didn't find that prefix in the openssl source either but since it ignores unknown entries in the cipher list control string, I haven't been exactly sure.

The question still kinda remains though. If cpython prefers that prefix, should we use it?

lukastribus commented 5 years ago

There is no point. Those bogus TLS13 ciphers have been removed from urllib3:

https://github.com/urllib3/urllib3/commit/1e9ab5aee042ff0158d0f443bc600ef3a2e7bf9a#diff-7c9a38cd64066636d0e73a2449a28640L86

ghost commented 5 years ago

I'd have to check the cpython code base to determine that. The point is to enable the use of TLSv1.3 in versions of openssl prior to v1.1.1. I don't think those ciphers were ever bogus. They may have become redundant with the latest version openssl but not bogus. It should be determined how the prefix affects or doesn't affect the usage of TLSv1.3 in openssl. Regardless the TLS prefix is good to use here unless cpython doesn't handle the cipher list control string in the way I would assume. For example, Node.js handles the list and calls the appropriate function to handle TLSv1.3 ciphers or other TLS ciphers respectively. AKA I think you're making too many assumptions.

lukastribus commented 5 years ago

Here's the cpython change:

https://github.com/python/cpython/commit/e8eb6cb7920ded66abc5d284319a8539bdc2bae3#diff-e144a9cacd10921d4dae0aeac0300a6fL3493

OpenSSL supports TLSv1.3 since 1.1.1, not before. The TLS13 prefixed ciphers are a relict of the OpenSSL development within the 1.1.1 development tree, before TLSv1.3 ever hit a stable OpenSSL release.

ghost commented 5 years ago

OpenSSL supports TLSv1.3 since 1.1.1, not before. The TLS13 prefixed ciphers are a relict of the OpenSSL development within the 1.1.1 development tree, before TLSv1.3 ever hit a stable OpenSSL release.

That would make sense, do you mind sharing the source of that information?

lukastribus commented 5 years ago

See: https://bugs.python.org/issue33570 https://github.com/openssl/openssl/pull/5392

ghost commented 5 years ago

I get what you're saying but I've observed that adding the TLSv1.3 ciphers to the control string has some effect in versions prior to v1.1.1. The effect was observed in Node.js, with the ciphers added, a user no longer received a CAPTCHA. See https://github.com/codemanki/cloudscraper/issues/211

So unless the older openssl source has been reviewed, I want to say it's an assumption.

ghost commented 5 years ago

I've glanced over the links that you shared. I don't see how that proves that there is no support for TLSv1.3 prior to v1.1.1. I appreciate you sharing the information either way!

ghost commented 5 years ago

So I do believe that we've went full circle and landed back at the original question. The openssl source will have to be checked (again) unless somebody can provide the information.

lukele commented 5 years ago

From https://wiki.openssl.org/index.php/TLS1.3

The OpenSSL git master branch (and the 1.1.1-pre9 beta version) contain our development TLSv1.3 code which is based on the final version of RFC8446 and can be used for testing purposes (i.e. it is not for production use). Earlier beta versions of OpenSSL 1.1.1 implemented draft versions of the standard.

This at least sounds like it

lukastribus commented 5 years ago

I didn't get that you meant the former part of my statement, I thought your where doubting the second part of it.

This is not an assumption.

Here are some of links: https://www.openssl.org/blog/blog/2018/09/11/release111/

The headline new feature is TLSv1.3.

https://wiki.openssl.org/index.php/TLS1.3

The OpenSSL 1.1.1 release includes support for TLSv1.3. The release is binary and API compatible with OpenSSL 1.1.0. In theory, if your application supports OpenSSL 1.1.0, then all you need to do to upgrade is to drop in the new version of OpenSSL and you will automatically start being able to use TLSv1.3.

https://www.openssl.org/blog/blog/2017/05/04/tlsv1.3/

The forthcoming OpenSSL 1.1.1 release will include support for TLSv1.3.

RFC8446 was released in August 2018. OpenSSL 1.1.0 was released in August 2016, 2 years before. In 2018 different TLSv1.3 drafts where in the wild. This is not something that would be backported to OpenSSL 1.1.0 stable.

ghost commented 5 years ago

There is an easier way besides googling to determine this... Just add the ciphers and test with openssl 1.1.0 using ssllabs...

lukastribus commented 5 years ago

There is doubt here at all.

ghost commented 5 years ago

Okay, personally, I still want to work a few things out. I do have some doubt. If you'd read this issue: https://github.com/codemanki/cloudscraper/issues/211#issuecomment-488061663 Maybe you'll better understand my point of view. I'll settle for an actual test or reviewing the codebase of openssl. Thanks for your contribution to this issue.

lukele commented 5 years ago

Since it appears that OpenSSL 1.1.0 did have some kind of partial TLS1.3 support, I wonder what's to be gained in specifically trying to enable TLS1.3 with such versions? Why not leave default ciphers/cipher suite as it is, and only remove the ones causing actual problems?

ghost commented 5 years ago

@lukastribus I have a question about this https://github.com/Anorov/cloudflare-scrape/issues/235#issuecomment-492020494

Since that change is only about a unit test, was that the only TLSv1.3 change in cpython?

ghost commented 5 years ago

Since it appears that OpenSSL 1.1.0 did have some kind of partial TLS1.3 support, I wonder what's to be gained in specifically trying to enable TLS1.3 with such versions? Why not leave default ciphers/cipher suite as it is, and only remove the ones causing actual problems?

Avoiding the CAPTCHA in some cases when using openssl < 1.1.1

lukele commented 5 years ago

Do we know of any such case for cfscrape at the moment? Wouldn't it be possible that in the node's case, cloudscrape also checks for a specific cipher list which is known to be used in node (as we suspected shortly for urllib3) and that's why a slight modification of the cipher list helps?

1.0.2r also didn't show me a captcha.

ghost commented 5 years ago

Do we know of any such case for cfscrape at the moment?

Not particularly but then we haven't had anybody to test this yet.

As is known, I haven't been able to reproduce this at all. What do I think about this?

This addresses all presently known issues and potentially prevents CAPTCHA with openssl <= 1.1.0.

Potentially prevents is all I've actually ever said about adding those ciphers. I do agree that there could be some other explanation. I think we need more feedback, specifically somebody to test with openssl prior to v1.1.1 who can normally reproduce the CAPTCHA.

ghost commented 5 years ago

1.0.2r also didn't show me a captcha.

When using https://github.com/Anorov/cloudflare-scrape/issues/235#issuecomment-491997405 or how?

lukele commented 5 years ago

Scratch that. Just realized I do see a captcha in python2 with openssl 1.0.2r In addition, there's a typo in my current pull request so the ciphers are not really removed (they are removed despite the typo), yet I don't see a captcha on pro-src.com