edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
64 stars 12 forks source link

Remove requests dependency #152

Open Mr0grog opened 10 months ago

Mr0grog commented 10 months ago

🚧 Work in Progress! 🚧

The big goal of the upcoming release is thread safety, which means removing requests (it is explicitly not guaranteed to be thread-safe, and it doesn’t sound like the current maintainers ever plan to make that guarantee). See #58 for more details and discussion here.

There’s a lot to clean up here. We have so many complicated workarounds for things that need unwinding! Requests does a lot of nice things that we now have to do ourselves! etc. I expect this to be messy and take a little bit; this branch will be failing tests pretty hard for now. It’s also the holidays and I will be traveling next week, so we’ll see what happens here.

My current approach here is:

  1. Remove requests and use urllib3 directly (requests is basically a wrapper around urllib3). This is going to mean adding a lot of little utilities and/or carefully balancing what we need to do for safety in our particular use case (requests does a whole lot of useful things that we will no longer have access to).

  2. Once that more-or-less works, briefly investigate switching over to httpx. Httpx is an entirely different stack, and therefore has totally different Exceptions, edge cases, etc. so I am bit worried about the safety concerns with switching directly to it.

  3. Decide whether to go forward with Httpx now, or clean up our urllib3 usage and stick with that for this release. Another possibility here is merging this PR as urllib3 and then immediately opening another for httpx, and not cutting a release until both are done.

Fixes #58. Fixes #82. Fixes #106. Supersedes #23.

danielballan commented 10 months ago

Two httpx comments:

Given that you plan to move through urllib3, one option is to hold at urllib3 and evaluate httpx after 1.0. They seem to be close.

Mr0grog commented 10 months ago

Good to know! I've used it for a few little things, but not on anything big or long enough to have a sense of how (un)stable the API is.

Mr0grog commented 10 months ago

OK, this gets all the tests passing for urllib3 v2, but more messy work has to be done for v1, since we have to reach in and wrangle internals.

It is also horrifyingly messy and ugly. Random bits of code are stuffed all over the place, quite a lot is copied or cribbed from requests, and there are # XXX: comments all over the place for questions and other bits that must be resolved before merging. This has also gotten me thinking that the right approach is halfway between this and #23; more on this at bottom.

OK, so the big highlights here are:


Concerns about structure. So I mentioned above that all this stuff implied to me that the right approach is something between this and what I was doing in #23. Basically, we have duplicated a tremendous amount of special sauce from requests so that we can get rid of it, but taking on the burden of maintaining all that doesn’t feel good. But the way I tried to magically make different sessions behind the scenes in #23, but also trade weird internals between them is also not good. Is there a way to address both of these? Maybe!

I noted above that WaybackSession inheriting from requests.Session is a real problem in our API, and that we probably need another layer of abstraction similar to requests.HTTPAdapter that just sends the HTTP request with whatever low level library it wants. With that model, we could maybe have an adapter that does something like #23, adding all the right locking and so on to make reading from it safe. It’d still be hacky, but would isolate the mess a little more cleanly than #23 does, would keep us from duplicating a lot of utilities and serialization logic and so on from requests, and gives us a clean upgrade path to httpx or any other HTTP library. That might be a bit space-brain, though. Very likely is more complicated and more work in practice than what I’m imagining in my head.

Paging @danielballan in case you have any thoughts or feedback.

Mr0grog commented 10 months ago

Tests actually fully pass, yay. Now it this just needs to iterate forward and clean things up while keeping them passing.

danielballan commented 10 months ago

This does feel like too much custom HTTP code for an application library.

In httpx, the Transport abstraction, sitting between the Client/Session abstraction and the connection pool, is the right place to put caching and rate limiting logic. (It also helps with testing, passing messages to an ASGI/WSGI server instead of a socket.) Recreating a light version of that Transport pattern here may possibly simplify things, and further lay track for a future refactor.

I wonder if this is a signal that going straight to httpx is the way to get thread safety without taking on too much maintenance burden. The breakages I have seen have been incidental and easy to fix—keyword argument name changes and comparable things.

Mr0grog commented 10 months ago

In httpx, the Transport abstraction, sitting between the Client/Session abstraction and the connection pool, is the right place to put caching and rate limiting logic. (It also helps with testing, passing messages to an ASGI/WSGI server instead of a socket.) Recreating a light version of that Transport pattern here may possibly simplify things, and further lay track for a future refactor.

Yeah, I think my concern here is that because rate limiting is so important, we need to pull it out a level above the equivalent of httpx’s transport (or requests's transport, same idea there, really), which also means pulling out redirects and retries. So we need two levels? Or to put that logic in a common part of the transport, and tell you to only override some sort of implementation function on the base transport class, or whatever. Or all that stuff goes up into the client directly, instead of between the client and the transport.

But anyway, yeah, I think we are roughly on the same idea here.

I wonder if this is a signal that going straight to httpx is the way to get thread safety without taking on too much maintenance burden. The breakages I have seen have been incidental and easy to fix—keyword argument name changes and comparable things.

Makes sense. I’m not excited about having a higher release cadence here just to account for that, but it is what it is. We are also in pre-1.0-for-way-too-long territory, so I might just be the pot calling the kettle black


Mr0grog commented 10 months ago

Anyway, I’m going to try and do a slight reorg/cleanup here (move the mocking tooling into a module, move the HTTP stuff into a module) just so the changes are easier to see and mentally organize.

But then I will probably put this on pause and back up to make a separate PR that just moves to more of a Client → Session → Transport model (maybe Client and Session get combined?) and then we can choose whether to rebuild this on top of that, rebuild #23 on top of that, or implement httpx on top of that, since it seems like the best way to accomplish any of those things. đŸ˜©