edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
67 stars 12 forks source link

Remove `requests` from public interfaces #157

Open Mr0grog opened 11 months ago

Mr0grog commented 11 months ago

WaybackSession was originally a subclass of requests.Session, and that turns out to have been a poor decision (see https://github.com/edgi-govdata-archiving/wayback/pull/152#issuecomment-1858800671). We wanted to make customizing HTTP behavior relatively straightforward, but using a subclass-able version of requests.Session wound up baking problematic parts of requests into our API. In particular, requests is not thread-safe and this setup makes it hard to wrap with thread-safe code. We also unintentionally broke the promise of easy subclassing when we added complex retry and rate limiting behavior.

This is a major, breaking change designed to largely remove requests from our public API:

Open question: I don’t think that WaybackSession should continue to exist. It makes the API complex and it lets the details of requests leak out a little bit through exception handling. Any feedback on this is welcome (cc @danielballan). Some ideas:

  1. Move its functionality directly into WaybackHttpAdapter.

    • Upside: this gets all the requests stuff — including exception handling — tucked away into one place.
    • Downside: Safely customizing WaybackHttpAdapter gets more complex because you have to work around all that machinery. It’s not as straightforward as just overriding the .request() method. Some ways to address that:
      • Add a WaybackHttpAdapter._request() method (not the underscore) that you can override if you just want to customize the sending of a single HTTP request.
      • Re-architect that functionality into some kind of service/helper object that you can use from a customized WaybackHttpAdapter subclass. I think this might be a lot more work than it's worth, though.
  2. Move its functionality into WaybackClient.

    • Upside: keeps all the things that are special about dealing with HTTP requests at a high level out of the lower-level adapter tooling. Keeps adapters as simple as possible.
    • Downside: Lets some exception handling that is specific to requests leak out into the rest of the package. Makes it harder to change that behavior. What if httpx or some other tool incorporates rate limiting in the future? It would be nice to let an adapter delegate those tasks to the lower-level tool it is acting as glue for.

I think I’m leaning towards some version of (1) here.

This paves the way towards fixing #58, but does not actually solve the issue (WaybackHttpAdapter is not yet thread-safe, and needs to use some of the tricks from #23).