Remove requests dependency

Mr0grog commented 10 months ago

🚧 Work in Progress! 🚧

The big goal of the upcoming release is thread safety, which means removing requests (it is explicitly not guaranteed to be thread-safe, and it doesn’t sound like the current maintainers ever plan to make that guarantee). See #58 for more details and discussion here.

There’s a lot to clean up here. We have so many complicated workarounds for things that need unwinding! Requests does a lot of nice things that we now have to do ourselves! etc. I expect this to be messy and take a little bit; this branch will be failing tests pretty hard for now. It’s also the holidays and I will be traveling next week, so we’ll see what happens here.

My current approach here is:

Remove requests and use urllib3 directly (requests is basically a wrapper around urllib3). This is going to mean adding a lot of little utilities and/or carefully balancing what we need to do for safety in our particular use case (requests does a whole lot of useful things that we will no longer have access to).
Once that more-or-less works, briefly investigate switching over to httpx. Httpx is an entirely different stack, and therefore has totally different Exceptions, edge cases, etc. so I am bit worried about the safety concerns with switching directly to it.
Decide whether to go forward with Httpx now, or clean up our urllib3 usage and stick with that for this release. Another possibility here is merging this PR as urllib3 and then immediately opening another for httpx, and not cutting a release until both are done.

Fixes #58. Fixes #82. Fixes #106. Supersedes #23.

danielballan commented 10 months ago

Two httpx comments:

Three years into working with httpx in my primary work project, I like it.
They are pre-1.0, and they do still break things.

Given that you plan to move through urllib3, one option is to hold at urllib3 and evaluate httpx after 1.0. They seem to be close.

Mr0grog commented 10 months ago

Good to know! I've used it for a few little things, but not on anything big or long enough to have a sense of how (un)stable the API is.

Mr0grog commented 10 months ago

OK, this gets all the tests passing for urllib3 v2, but more messy work has to be done for v1, since we have to reach in and wrangle internals.

It is also horrifyingly messy and ugly. Random bits of code are stuffed all over the place, quite a lot is copied or cribbed from requests, and there are # XXX: comments all over the place for questions and other bits that must be resolved before merging. This has also gotten me thinking that the right approach is halfway between this and #23; more on this at bottom.

OK, so the big highlights here are:

WaybackSession is now its own thing instead of a subclass of requests.Session. I think we did this originally so someone could substitute in their own session with whatever weird proxy configuration, etc. they might want, but then we broke that capability with all the messy rate limiting and timing stuff we threw in there and then depended upon elsewhere. This change needed to happen even if we didn’t drop requests.
- There is still both a request() and send() function, but there’s no meaningful distinction anymore. They should probably be merged.
- For future changes, e.g. switching to httpx, we probably need another layer of abstraction, like an requests.HTTPAdapter, that just translates a request from our code down to whatever HTTP library we are using for transport. All the logic around rate limiting, retries, etc. stays up in WaybackSession and the adapter/transport/whatever just has to convert a method + absolute url + body + timeout? into a request on the underlying HTTP library. (At this point, it might make sense to move all the stuff that remains part of WaybackSession directly into WaybackClient.)
- All the stuff in here probably needs to move into a separate _http.py module or something. It’s way too much and way too messy for _client.py, and it’s more meaningful to group it than to drop a huge pile of stuff in _utils.py.
I wound up making an InternalHttpResponse class that is very similar to requests.Response. I was hoping to avoid something like that, but realized early on that there was so much complicated stuff around reading/caching/not caching/draining/releasing the urllib3 response object that it needed to be there. Then it became a hook for other conveniences, and slowly accreted a lot of junk.
- This class is very un-threadsafe. It needs a bunch of locking added.
- There are probably a few convenience things we can back out of here, and maybe some things where I did caching of various dynamic properties that don’t really need to be cached.
I couldn’t find any mocking equivalent to requests-mock except urllib3-mock which hasn’t been touched in 8 years. It sounds like it’s got some bugs, which is no surprise since urllib3 has changed a lot in that timeframe, including a major version update that totally changed a lot of internals. Instead, I made a really crappy copy of just the parts of requests-mock that we use: https://github.com/edgi-govdata-archiving/wayback/blob/eeadae13aa14773b5c231cd50e9fb324fa47b8e3/wayback/tests/test_client.py#L226-L296 It’s terrible, but it works. (Obviously it needs to move out of that module and into support.py or something similar.)
Lots of complicated bits around exception handling that probably need careful testing. I think a bunch of these are covered by tests, but probably not all of them, as many are in fact hard to test for (under what weird conditions does a given HTTP library throw which weird, low-level exceptions from sockets and such?). I think I was fairly careful, but left a bunch of # XXX: notes in places that I want to revisit with a closer eye.
There is, again, just a lot of of added code here, which adds lots of risk to this change. I expected that, but it’s still a lot when you look at it.
⚠️ I discovered a spot where our current rate limiting is not correct that needs fixing, although I haven’t addressed it yet in this PR. When search() makes requests, it lets the HTTP library follow redirects. Since that happens on a lower level, redirects don’t get throttled (luckily I don’t think this should generally happen in practice for those particular requests, but still, should fix). This is not an issue for Mementos, where we have a whole mess of redirect logic because redirects on those just do not work in quite the way you expect normal web directs to.

Really this means we need to add some more logic in WaybackSession.send() for following redirects, and we should never tell the HTTP library we are using (urllib3 or requests, or httpx or whatever) to follow a redirect.

Concerns about structure. So I mentioned above that all this stuff implied to me that the right approach is something between this and what I was doing in #23. Basically, we have duplicated a tremendous amount of special sauce from requests so that we can get rid of it, but taking on the burden of maintaining all that doesn’t feel good. But the way I tried to magically make different sessions behind the scenes in #23, but also trade weird internals between them is also not good. Is there a way to address both of these? Maybe!

I noted above that WaybackSession inheriting from requests.Session is a real problem in our API, and that we probably need another layer of abstraction similar to requests.HTTPAdapter that just sends the HTTP request with whatever low level library it wants. With that model, we could maybe have an adapter that does something like #23, adding all the right locking and so on to make reading from it safe. It’d still be hacky, but would isolate the mess a little more cleanly than #23 does, would keep us from duplicating a lot of utilities and serialization logic and so on from requests, and gives us a clean upgrade path to httpx or any other HTTP library. That might be a bit space-brain, though. Very likely is more complicated and more work in practice than what I’m imagining in my head.

Paging @danielballan in case you have any thoughts or feedback.

Mr0grog commented 10 months ago

Tests actually fully pass, yay. Now it this just needs to iterate forward and clean things up while keeping them passing.

danielballan commented 10 months ago

This does feel like too much custom HTTP code for an application library.

In httpx, the Transport abstraction, sitting between the Client/Session abstraction and the connection pool, is the right place to put caching and rate limiting logic. (It also helps with testing, passing messages to an ASGI/WSGI server instead of a socket.) Recreating a light version of that Transport pattern here may possibly simplify things, and further lay track for a future refactor.

I wonder if this is a signal that going straight to httpx is the way to get thread safety without taking on too much maintenance burden. The breakages I have seen have been incidental and easy to fix—keyword argument name changes and comparable things.

Mr0grog commented 10 months ago

In httpx, the Transport abstraction, sitting between the Client/Session abstraction and the connection pool, is the right place to put caching and rate limiting logic. (It also helps with testing, passing messages to an ASGI/WSGI server instead of a socket.) Recreating a light version of that Transport pattern here may possibly simplify things, and further lay track for a future refactor.

Yeah, I think my concern here is that because rate limiting is so important, we need to pull it out a level above the equivalent of httpx’s transport (or requests's transport, same idea there, really), which also means pulling out redirects and retries. So we need two levels? Or to put that logic in a common part of the transport, and tell you to only override some sort of implementation function on the base transport class, or whatever. Or all that stuff goes up into the client directly, instead of between the client and the transport.

But anyway, yeah, I think we are roughly on the same idea here.

I wonder if this is a signal that going straight to httpx is the way to get thread safety without taking on too much maintenance burden. The breakages I have seen have been incidental and easy to fix—keyword argument name changes and comparable things.

Makes sense. I’m not excited about having a higher release cadence here just to account for that, but it is what it is. We are also in pre-1.0-for-way-too-long territory, so I might just be the pot calling the kettle black…

Mr0grog commented 10 months ago

Anyway, I’m going to try and do a slight reorg/cleanup here (move the mocking tooling into a module, move the HTTP stuff into a module) just so the changes are easier to see and mentally organize.

But then I will probably put this on pause and back up to make a separate PR that just moves to more of a Client → Session → Transport model (maybe Client and Session get combined?) and then we can choose whether to rebuild this on top of that, rebuild #23 on top of that, or implement httpx on top of that, since it seems like the best way to accomplish any of those things. 😩

edgi-govdata-archiving / wayback

Remove requests dependency #152