Open Mr0grog opened 10 months ago
Two httpx comments:
Given that you plan to move through urllib3, one option is to hold at urllib3 and evaluate httpx after 1.0. They seem to be close.
Good to know! I've used it for a few little things, but not on anything big or long enough to have a sense of how (un)stable the API is.
OK, this gets all the tests passing for urllib3 v2, but more messy work has to be done for v1, since we have to reach in and wrangle internals.
It is also horrifyingly messy and ugly. Random bits of code are stuffed all over the place, quite a lot is copied or cribbed from requests, and there are # XXX:
comments all over the place for questions and other bits that must be resolved before merging. This has also gotten me thinking that the right approach is halfway between this and #23; more on this at bottom.
OK, so the big highlights here are:
WaybackSession
is now its own thing instead of a subclass of requests.Session
. I think we did this originally so someone could substitute in their own session with whatever weird proxy configuration, etc. they might want, but then we broke that capability with all the messy rate limiting and timing stuff we threw in there and then depended upon elsewhere. This change needed to happen even if we didnât drop requests.
request()
and send()
function, but thereâs no meaningful distinction anymore. They should probably be merged.requests.HTTPAdapter
, that just translates a request from our code down to whatever HTTP library we are using for transport. All the logic around rate limiting, retries, etc. stays up in WaybackSession
and the adapter/transport/whatever just has to convert a method + absolute url + body + timeout? into a request on the underlying HTTP library. (At this point, it might make sense to move all the stuff that remains part of WaybackSession
directly into WaybackClient
.)_http.py
module or something. Itâs way too much and way too messy for _client.py
, and itâs more meaningful to group it than to drop a huge pile of stuff in _utils.py
.I wound up making an InternalHttpResponse
class that is very similar to requests.Response
. I was hoping to avoid something like that, but realized early on that there was so much complicated stuff around reading/caching/not caching/draining/releasing the urllib3 response object that it needed to be there. Then it became a hook for other conveniences, and slowly accreted a lot of junk.
I couldnât find any mocking equivalent to requests-mock except urllib3-mock which hasnât been touched in 8 years. It sounds like itâs got some bugs, which is no surprise since urllib3 has changed a lot in that timeframe, including a major version update that totally changed a lot of internals. Instead, I made a really crappy copy of just the parts of requests-mock
that we use: https://github.com/edgi-govdata-archiving/wayback/blob/eeadae13aa14773b5c231cd50e9fb324fa47b8e3/wayback/tests/test_client.py#L226-L296 Itâs terrible, but it works. (Obviously it needs to move out of that module and into support.py
or something similar.)
Lots of complicated bits around exception handling that probably need careful testing. I think a bunch of these are covered by tests, but probably not all of them, as many are in fact hard to test for (under what weird conditions does a given HTTP library throw which weird, low-level exceptions from sockets and such?). I think I was fairly careful, but left a bunch of # XXX:
notes in places that I want to revisit with a closer eye.
There is, again, just a lot of of added code here, which adds lots of risk to this change. I expected that, but itâs still a lot when you look at it.
â ïž I discovered a spot where our current rate limiting is not correct that needs fixing, although I havenât addressed it yet in this PR. When search()
makes requests, it lets the HTTP library follow redirects. Since that happens on a lower level, redirects donât get throttled (luckily I donât think this should generally happen in practice for those particular requests, but still, should fix). This is not an issue for Mementos, where we have a whole mess of redirect logic because redirects on those just do not work in quite the way you expect normal web directs to.
Really this means we need to add some more logic in WaybackSession.send()
for following redirects, and we should never tell the HTTP library we are using (urllib3 or requests, or httpx or whatever) to follow a redirect.
Concerns about structure. So I mentioned above that all this stuff implied to me that the right approach is something between this and what I was doing in #23. Basically, we have duplicated a tremendous amount of special sauce from requests so that we can get rid of it, but taking on the burden of maintaining all that doesnât feel good. But the way I tried to magically make different sessions behind the scenes in #23, but also trade weird internals between them is also not good. Is there a way to address both of these? Maybe!
I noted above that WaybackSession
inheriting from requests.Session
is a real problem in our API, and that we probably need another layer of abstraction similar to requests.HTTPAdapter
that just sends the HTTP request with whatever low level library it wants. With that model, we could maybe have an adapter that does something like #23, adding all the right locking and so on to make reading from it safe. Itâd still be hacky, but would isolate the mess a little more cleanly than #23 does, would keep us from duplicating a lot of utilities and serialization logic and so on from requests, and gives us a clean upgrade path to httpx or any other HTTP library. That might be a bit space-brain, though. Very likely is more complicated and more work in practice than what Iâm imagining in my head.
Paging @danielballan in case you have any thoughts or feedback.
Tests actually fully pass, yay. Now it this just needs to iterate forward and clean things up while keeping them passing.
This does feel like too much custom HTTP code for an application library.
In httpx, the Transport abstraction, sitting between the Client/Session abstraction and the connection pool, is the right place to put caching and rate limiting logic. (It also helps with testing, passing messages to an ASGI/WSGI server instead of a socket.) Recreating a light version of that Transport pattern here may possibly simplify things, and further lay track for a future refactor.
I wonder if this is a signal that going straight to httpx is the way to get thread safety without taking on too much maintenance burden. The breakages I have seen have been incidental and easy to fixâkeyword argument name changes and comparable things.
In httpx, the Transport abstraction, sitting between the Client/Session abstraction and the connection pool, is the right place to put caching and rate limiting logic. (It also helps with testing, passing messages to an ASGI/WSGI server instead of a socket.) Recreating a light version of that Transport pattern here may possibly simplify things, and further lay track for a future refactor.
Yeah, I think my concern here is that because rate limiting is so important, we need to pull it out a level above the equivalent of httpxâs transport (or requests's transport, same idea there, really), which also means pulling out redirects and retries. So we need two levels? Or to put that logic in a common part of the transport, and tell you to only override some sort of implementation function on the base transport class, or whatever. Or all that stuff goes up into the client directly, instead of between the client and the transport.
But anyway, yeah, I think we are roughly on the same idea here.
I wonder if this is a signal that going straight to httpx is the way to get thread safety without taking on too much maintenance burden. The breakages I have seen have been incidental and easy to fixâkeyword argument name changes and comparable things.
Makes sense. Iâm not excited about having a higher release cadence here just to account for that, but it is what it is. We are also in pre-1.0-for-way-too-long territory, so I might just be the pot calling the kettle blackâŠ
Anyway, Iâm going to try and do a slight reorg/cleanup here (move the mocking tooling into a module, move the HTTP stuff into a module) just so the changes are easier to see and mentally organize.
But then I will probably put this on pause and back up to make a separate PR that just moves to more of a Client â Session â Transport model (maybe Client and Session get combined?) and then we can choose whether to rebuild this on top of that, rebuild #23 on top of that, or implement httpx on top of that, since it seems like the best way to accomplish any of those things. đ©
đ§ Work in Progress! đ§
The big goal of the upcoming release is thread safety, which means removing requests (it is explicitly not guaranteed to be thread-safe, and it doesnât sound like the current maintainers ever plan to make that guarantee). See #58 for more details and discussion here.
Thereâs a lot to clean up here. We have so many complicated workarounds for things that need unwinding! Requests does a lot of nice things that we now have to do ourselves! etc. I expect this to be messy and take a little bit; this branch will be failing tests pretty hard for now. Itâs also the holidays and I will be traveling next week, so weâll see what happens here.
My current approach here is:
Remove requests and use urllib3 directly (requests is basically a wrapper around urllib3). This is going to mean adding a lot of little utilities and/or carefully balancing what we need to do for safety in our particular use case (requests does a whole lot of useful things that we will no longer have access to).
Once that more-or-less works, briefly investigate switching over to httpx. Httpx is an entirely different stack, and therefore has totally different Exceptions, edge cases, etc. so I am bit worried about the safety concerns with switching directly to it.
Decide whether to go forward with Httpx now, or clean up our urllib3 usage and stick with that for this release. Another possibility here is merging this PR as urllib3 and then immediately opening another for httpx, and not cutting a release until both are done.
Fixes #58. Fixes #82. Fixes #106. Supersedes #23.