Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.35k stars 458 forks source link

Captcha issues #235

Closed Anorov closed 5 years ago

Anorov commented 5 years ago

The latest version of cfscrape should not encounter captchas, unless you're using Tor or another IP that Cloudflare has blacklisted. If you're getting a captcha error, first please run pip install -U cfscrape and try again. If you're still getting an error, please leave a comment.


Please put all captcha challenge-related issues here.

Please run the following to determine the OpenSSL version compiled with your Python binary and include the output in your comment:

$ python3 -c 'import ssl; print(ssl.OPENSSL_VERSION)'
OpenSSL 1.1.1b  26 Feb 2019

(Or python instead of python3 if running on Python 2.)

ghost commented 5 years ago

@lukele I know you've done a lot of testing already but do you mind trying with and without the TLSv1.3 ciphers. If there isn't any difference then I think it's safe to drop them unless proven otherwise.

lukele commented 5 years ago

Was close to losing it here, since I've remembered that I did have a version which worked with python2 and openssl <= 1.1.1

:!ECDHE+SHA doesn't work :!SHA1 doesn't work :!AES128-SHA does work

So the following cipher suite modification works for me with Python 2.7.15/OpenSSL 1.0.2r and Python 3.7.3/OpenSSL 1.1.1b

:!ECDHE+SHA:!AES128-SHA

lukele commented 5 years ago

@pro-src This is the list without TLSv1.3, correct?

ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5:!ECDHE+SHA

ghost commented 5 years ago

Yup, I think so.

lukele commented 5 years ago

That ends with a captcha. Adding !AES128-SHA fixes it again

lukele commented 5 years ago

Since it appears that OpenSSL 1.1.0 did have some kind of partial TLS1.3 support

Wrong. There is absolutely no TLSv1.3 support, not even partial, in OpenSSL 1.1.0.

Yes, I believe to have misunderstood this. Only reference I could find was: https://securityinaction.wordpress.com/2016/09/13/openssl-1-1-0-adds-partial-tls-1-3-support/

Regardless, in my limited testing, python versions linked against OpenSSL < 1.1.1 seem to work now as well, by removing AES128-SHA

ghost commented 5 years ago

@lukastribus

I don't know anything about nodejs, but it seems to me that you just configured a large cipher list that happened to contain TLSv1.3 ciphers, and because it works, you assumed the TLSv1.3 ciphers did the job. But there is more in that configuration than TLSv1.3 ciphers and you did not make the opposite test: the same list without TLSv1.3 ciphers.

Response ***You assumed that I assumed.*** The list is the default ciphers for the exact version of Node.js that it was tested with except the TLSv1.3 ciphers were appended to the bottom of the list. The test included a version check to ensure that the OP didn't use the wrong version of Node.js since he had two versions installed on his system. Node.js compiles with openssl so it was a very clear test. https://nodejs.org/download/release/v10.8.0/node-v10.8.0-linux-x64.tar.gz `require('tls').getCiphers().map(s => s.toUpperCase())` Furthermore, I am still able to reproduce this entirely. I get a CAPTCHA without adding those ciphers on Node.js 10.8.0. i.e. If I remove the change: `require('cloudscraper').defaultParams.agentOptions = {};` Lastly, adding those ciphers causes no harm, not that I'm saying that we should add them at this point... Update: Since being prompted to re-investigate, I've noticed that there is in fact a different explanation to be given that isn't obvious. I do not wish to discuss it here. It could be discussed at https://github.com/codemanki/cloudscraper/issues

I understand you guys have this impression, it's wrong. Here's my last attempt at convincing you:

Why? Did you not read this comment? https://github.com/Anorov/cloudflare-scrape/issues/235#issuecomment-492035224

I have the impression that you're stuck arguing about those ciphers. I don't believe that you've actually helped solve the current problem at all. Please focus on the issue at hand.

ghost commented 5 years ago

@lukele The PR looks great and is a minimal solution. :tada:

ghost commented 5 years ago

I'm still curious about that SSLv3 handshake error. The adapter can be toggled on and off so I'm not really worried about it, just curious.

lukele commented 5 years ago

Are you still seeing that error?

ghost commented 5 years ago

I just tested against ssllabs on python 2 and 3: <Response [200]> (No errors)

ghost commented 5 years ago

@lukele I updated the script. If you want to generate a full report so we'll have something to reference, all of the hashes were removed.

Note that by normal conditions, it actually means your PR.

Here's mine for reference: https://gist.github.com/pro-src/c1c464e394e4d25b81b16a1991b72316

lukele commented 5 years ago

@pro-src should I run the script with python3 and python2?

ghost commented 5 years ago

Maybe. I want all the information that we can get but I imagine that the results will be the same so long as the same package versions are used. If you add the hashes from the first run, you'll save us from having to compare the results of the second.

lukele commented 5 years ago

Unfortunately python2 doesn't support SSLSocket.shared_ciphers() and I couldn't quickly find a way to get to that information.

This is my report from python3: https://gist.github.com/lukele/8d00ad380bb70e27dc43f8e7f3d57472

Hashes for DEFAULT ciphers hash, Signature algorithms hash don't match.

Also, why does your report have

https://pro-src.com TLSv1.1 ciphers hash: 51f39199142f9474451bf94a47f70a46

but my report has

https://pro-src.com DEFAULT :!ECDHE+SHA:!AES128-SHA ciphers hash: 59fac59572e9abb2cdf770e0debfdfa3

ghost commented 5 years ago

Thanks for generating the report.

There is only one way to check the ciphers when sending requests to https://pro-src.com whereas there is two ways when requests are made to https://howmyssl.com and https://www.ssllab.coms. The hashes are appended to the list of known hashes as soon they're encountered to avoid duplicates even when there isn't any known initially. There is more requests listed in the report than cipher lists because some duplicates were filtered out.

Unique from previous requests for me: https://pro-src.com TLSv1.1 ciphers hash: 51f39199142f9474451bf94a47f70a46

Unique from previous requests for you: https://pro-src.com DEFAULT :!ECDHE+SHA:!AES128-SHA ciphers hash: 59fac59572e9abb2cdf770e0debfdfa3

Those two types of requests are the last two made by the script. I hope that makes sense.

lukele commented 5 years ago

Ah ok, yes. That explains it. Should I generate a report in python2 without the shared ciphers information, or does that render it useless?

ghost commented 5 years ago

If you modify the script, you can still generate a report without the get_ciphers method for howsmyssl and ssllabs.

Also, we can draw conclusion a from your most recent report. The ciphers that are being reported by the socket when using DEFAULT were not unique from the ones reported when using TLSv1.1. I'm guessing that's because of the openssl version and cfscrape from the PR.

lukele commented 5 years ago

Report for python2: https://gist.github.com/lukele/935f48080f56b8221863832debecd320

ghost commented 5 years ago

@lukele Why do you think that none of the hashes matched the python3 report?

Nvm, you're using a different version of openssl for python2. Good results. Now you've got me interested to see the results when using the same version of openssl. :smiley:

lukele commented 5 years ago

Yep, python2 is linked to openssl 1.0.2r. Sorry I believed this was known from previous tests. I currently don't have a python2 version linked against openssl 1.1.1

That's gonna take sometime.

ghost commented 5 years ago

Yeh, I forgot. I think we have enough information for now if you don't want to bother with it.

ghost commented 5 years ago

my openssl is OpenSSL 1.1.0i 14 Aug 2018 please help me, How to update openssl to 1.1.1? i have done install from https://slproweb.com/products/Win32OpenSSL.html but, in python not work :(

Anorov commented 5 years ago

Thanks for the contributions. I'm no longer getting captchas with the latest revision.

Why do you think it's better to not enable the adapter by default?

And not too important, but to get around the awkward naming, maybe we could remove some of the references to captchas in the code and just call it something like a CloudflareConformingCipherAdapter, or maybe CloudflareConformingAdapter? It'd be more generalized; Cloudflare's a pretty innovative and fast-growing company, so I wouldn't be surprised if they throw up hurdles in the future that don't involve captchas.

ghost commented 5 years ago

Why do you think it's better to not enable the adapter by default?

I'm +1 for turning it on by default. You never know whether or not somebody is gonna attempt to scrape some old/ill configured server which demands the use of the removed TLSv1.0 ciphers. At one point there was an issue with SSLv3 ciphers being removed but that's not longer an issue. So, having it enabled by default would actually be a security feature. Although, it potentially causes request failures where urllib3 would otherwise succeed. It's a highly unlikely scenario though. It's usually the other way around, the client (For example IE6) only has old ciphers and newer server's have dropped those insecure ciphers so a handshake error occurs.

CloudflareConformingAdapter

Naming :thinking: I agree on all points but I still don't really like it much either. It's the count of words and length that I keep raising an eyebrow at. How about one of these `ConnectionAdapter`, `ContextAdapter`, `SessionAdapter`, `ClientAdapter`, `RequestAdapter`, `TransportAdapter`, `SecureAdapter` or `HTTPSAdapter`?

<BaseAdapter> -> <HTTPAdapter> -> <???>

lukele commented 5 years ago

Length doesn‘t bother me, I much prefer to have a name reflecting what it does. CipherCustomizationAdapter or simply CustomCiphersAdapter?

The idea of not turning it on by default was to only change the ciphers when absolutely necessary (only initial request to cloudflare would require it, i found in my testing) and stick to urllib3‘s/openssl‘s defaults otherwise.

Not against enabling it by default however.

ghost commented 5 years ago
Naming :thinking: I think so long as the name generally describes what it does then having comments to elaborate on the portions of code that aren't clear will suffice. This being a library, if somebody begins to depend on the existence of the adapter class, there could be an issue with the name not matching functionality added at a later time. If you do rename it to match the new functionality, you'll have a breaking change. Although, it's not really an issue if it's not exposed in a public way. https://www.python.org/dev/peps/pep-0008/#overriding-principle Something that might be of help is to determine the usage vs. the implementation and what is the corresponding convention. Also, whether or not we really want this to be a public class. If not, we could move the code out of `__init__.py` and into a `cfscrape.py` in order to use `__init__.py` to control what is exposed.

I'm not opposed to any of the suggestions. ¯\_(ツ)_/¯

lukele commented 5 years ago

Since this is such a small class, there‘s no harm in calling it CustomCiphers also sounds reasonable in combination with .mount

Anorov commented 5 years ago

Ok, let's enable it by default. I can make the change if you guys want to; otherwise please feel free to make the change yourself.

As for the name, I agree with all of the points above. How about just CloudflareAdapter? Or if you think there's very little chance anyone would ever muck around with it, I'm +1 for CustomCiphers or CipherOverride or something like that.