Anorov / cloudflare-scrape

A Python module to bypass Cloudflare's anti-bot page.
MIT License
3.38k stars 459 forks source link

Header order is lost when passing a session to create_scraper #247

Closed ghost closed 5 years ago

ghost commented 5 years ago

The headers copied from a normal requests Session instance are not ordered aka isinstance(session.headers, OrderedDict) is False. If that session is passed to cfscrape.create_scraper(sess=requests.Session()) the scraper returned will not have it's headers attribute defined properly since it's overridden.

https://github.com/Anorov/cloudflare-scrape/blob/2ffeb22b78d64b2b8007a8521ac276b13c0ac306/cfscrape/__init__.py#L336-L358

ghost commented 5 years ago

If we update scraper.headers instead of replacing the attribute, we still get requests default headers in place of our own. I think since headers can be passed as a keyword argument or on a per request basis that the headers shouldn't be modified outside of the constructor by this helper function.

scraper.headers.update(headers) is another option. At least that way the user is explicitly overriding the defaults instead of this breaking silently.

ghost commented 5 years ago

I was thinking something like this:

Code snippet ```py @classmethod def create_scraper(cls, sess=None, **kwargs): """ Convenience function for creating a ready-to-go CloudflareScraper object. """ if sess: if hasattr(sess, 'headers'): # Skip this if headers == requests default headers kwargs.setdefault('headers', sess.headers) scraper = cls(**kwargs) if sess: exclude = ('headers',) attrs = (x for x in scraper.__attrs__ if not x in exclude) for x in attrs: if hasattr(sess, x): setattr(scraper, x, getattr(sess, x)) return scraper ```

@Anorov @lukele What do you think?

lukele commented 5 years ago

@pro-src I really like this approach, since there's no need to further modify __init__ to handle the custom headers passed in. I wonder what the developer's expectations would be were they to pass in headers as kwarg to create_scraper and sess at the same time. Should headers from kwargs and headers from sess be merged?

ghost commented 5 years ago

It's a good point but I think the alternative that I had in mind is worse. The alternative being not attempting to copy the headers from the session at all. I see this as avoiding a common pitfall that would result in CAPTCHA for a lot of users since copying headers from a previous session that actually works with Cloudflare would be very uncommon. I'm guessing that 90% of the time we'd be copying requests default headers.

lukele commented 5 years ago

Yeah, it would be interesting to know how many people really pass in sess. I really don't see much of a benefit to it. So I'm for either dropping sess, or using this solution.

ghost commented 5 years ago

I think any solution here is going to require a minor version bump but if you drop that argument entirely, you'll need a major version bump.

lukele commented 5 years ago

True. So let's stick with your solution for now.

ghost commented 5 years ago

Question though, what would be the point of having a create_scraper function without the sess argument? Just API sugar?

lukele commented 5 years ago

That was my understanding of it at least, but I guess I never questioned it, since it's the way it is used in the readme.

ghost commented 5 years ago

It seems to be a source of confusion for non-advanced users that have never seen the library's code. I know of at least a couple of cases where people thought they had to create a session using requests and pass it to create_scraper to cause the scraper instance to use a session. They fail to realize that the scraper instance is a session instance. I had to point that out to VeNoMouS lol.

lukele commented 5 years ago

Yeah I remember VeNoMouS telling me you had to do that, and didn't even understand what he meant at first. It's quite clearly mentioned in the README though

CloudflareScraper works identically to a Requests Session object.

ghost commented 5 years ago

I think adding docstrings and generating some docs for readthedocs.org would be a great improvement. There isn't all that much to comment on, just would want to link to requests docs properly.

ghost commented 5 years ago

I've made up my mind on how I think this should be done. The headers should be merged in order to retain order and the sess argument should be incompatible with other arguments. If other arguments such as header/cookies/params/data are allowed with the sess argument then they should be merged.

There is a helper function to aid in merging those attributes: https://github.com/kennethreitz/requests/blob/a79a63390bc963e5924021086744e53585566307/requests/sessions.py#L49-L77

But at least cookies would require special handling. I'm voting to disallow extra argument with the sess argument. Whether or not to change the current behavior to make use of Session.__attrs__ is a completely different issue. I don't plan on including that in a PR.

Code snippet ```py from requests.utils import default_headers as requests_headers @classmethod def create_scraper(cls, sess=None, **kwargs): """ Convenience function for creating a ready-to-go CloudflareScraper object. """ if not sess: return cls(**kwargs) if len(kwargs) > 1: raise ValueError('Passing arguments with "sess" isn\'t currently supported') scraper = cls() headers = getattr(sess, 'headers', None): if headers and headers != requests_headers(): scraper.headers.update(headers) exclude = ('headers',) attrs = (x for x in scraper.__attrs__ if not x in exclude) for x in attrs: if hasattr(sess, x): setattr(scraper, x, getattr(sess, x)) return scraper ```

I think we can release this with a minor bump since passing sess with keyword arguments doesn't currently make sense and didn't really work for 99% of use cases anyway.

Alternatively, we can simply keep everything the same and only address the headers which might be the best possible option as of right now.

@Anorov @lukele