Closed ghost closed 5 years ago
If we update scraper.headers
instead of replacing the attribute, we still get requests
default headers in place of our own. I think since headers can be passed as a keyword argument or on a per request basis that the headers shouldn't be modified outside of the constructor by this helper function.
scraper.headers.update(headers)
is another option. At least that way the user is explicitly overriding the defaults instead of this breaking silently.
I was thinking something like this:
@Anorov @lukele What do you think?
@pro-src I really like this approach, since there's no need to further modify __init__
to handle the custom headers passed in. I wonder what the developer's expectations would be were they to pass in headers
as kwarg
to create_scraper and sess
at the same time. Should headers from kwargs
and headers
from sess
be merged?
It's a good point but I think the alternative that I had in mind is worse. The alternative being not attempting to copy the headers from the session at all. I see this as avoiding a common pitfall that would result in CAPTCHA for a lot of users since copying headers from a previous session that actually works with Cloudflare would be very uncommon. I'm guessing that 90% of the time we'd be copying requests
default headers.
Yeah, it would be interesting to know how many people really pass in sess
. I really don't see much of a benefit to it. So I'm for either dropping sess
, or using this solution.
I think any solution here is going to require a minor version bump but if you drop that argument entirely, you'll need a major version bump.
True. So let's stick with your solution for now.
Question though, what would be the point of having a create_scraper
function without the sess
argument? Just API sugar?
That was my understanding of it at least, but I guess I never questioned it, since it's the way it is used in the readme.
It seems to be a source of confusion for non-advanced users that have never seen the library's code. I know of at least a couple of cases where people thought they had to create a session using requests
and pass it to create_scraper
to cause the scraper instance to use a session. They fail to realize that the scraper instance is a session instance. I had to point that out to VeNoMouS lol.
Yeah I remember VeNoMouS telling me you had to do that, and didn't even understand what he meant at first. It's quite clearly mentioned in the README though
CloudflareScraper
works identically to a Requests Session object.
I think adding docstrings and generating some docs for readthedocs.org would be a great improvement. There isn't all that much to comment on, just would want to link to requests docs properly.
I've made up my mind on how I think this should be done. The headers should be merged in order to retain order and the sess
argument should be incompatible with other arguments. If other arguments such as header/cookies/params/data are allowed with the sess
argument then they should be merged.
There is a helper function to aid in merging those attributes: https://github.com/kennethreitz/requests/blob/a79a63390bc963e5924021086744e53585566307/requests/sessions.py#L49-L77
But at least cookies would require special handling. I'm voting to disallow extra argument with the sess
argument. Whether or not to change the current behavior to make use of Session.__attrs__
is a completely different issue. I don't plan on including that in a PR.
I think we can release this with a minor bump since passing sess
with keyword arguments doesn't currently make sense and didn't really work for 99% of use cases anyway.
Alternatively, we can simply keep everything the same and only address the headers which might be the best possible option as of right now.
@Anorov @lukele
The headers copied from a normal
requests
Session instance are not ordered akaisinstance(session.headers, OrderedDict) is False
. If that session is passed tocfscrape.create_scraper(sess=requests.Session())
the scraper returned will not have it'sheaders
attribute defined properly since it's overridden.https://github.com/Anorov/cloudflare-scrape/blob/2ffeb22b78d64b2b8007a8521ac276b13c0ac306/cfscrape/__init__.py#L336-L358