encode / httpx

A next generation HTTP client for Python. πŸ¦‹
https://www.python-httpx.org/
BSD 3-Clause "New" or "Revised" License
13.07k stars 833 forks source link

Community discussion #78

Closed tomchristie closed 5 years ago

tomchristie commented 5 years ago

Although I've not yet done much in the way of documentation or branding this pacakge up, it's quickly narrowing in on a slice of functionality that's lacking in the Python HTTP landscape.

Specifically:

I've also some thoughts on allowing the Request and Response models to provide werkzeug-like interfaces, allowing them to be used either client-side or server-side. One of the killer-apps of the new async+HTTP/2 functionality is allowing high-throughput proxy and gateway services to be easily built in Python. Having a "requests"-like package that can also use the models on the server side is something I may want to explore once all the other functionality is sufficiently nailed down.

Since "requests" is an essential & neccessary part of the Python ecosystem, and since this package is aiming to be the next steps on from that, I think it's worth opening up a bit of community discussion here, even if it's early days.

I'd originally started out expecting httpcore to be a silent-partner dependency of any requests3 package, but it progressed fairly quickly from there into "actually I've got a good handle on this, I think I need to implement this all the way through". My biggest questions now are around what's going to be the most valuable ways to deliver this work to the commnunity.

Ownership, Funding & Maintainence

Given how critical a requests-like HTTP client is to the Python ecosystem as a whole I'd be ammenable to community discussions around ownership & funding options.

I guess that I need to out by documenting & pitching this package in it's own right, releasing it under the same banner and model at all the other Encode work, and then take things from there if and when it starts to gain any adoption.

I'm open to ideas from the urllib3 or requests teams, if there's alternatives that need to be explored early on.

Requests

The functionality that this pacakge is homing in on meets the requirements for the proposed "Requests III". Perhaps there's something to be explored there, if the requests team is interested, and if we can find a good community-focused arrangement around funding & ownership.

urllib3

The urllib3 team obvs. have a vast stack of real-world usage expertise that'd be important for us to make use of. There's bits of work that urllib3 does, that httpcore likely needs to do, including fuzziness around how content decoding actually ends up looking on the the real, messy web. Or, for example, pulling early responses before necessarily having fully sent the outgoing request.

Something else that could well be valuable would be implementing a urllib3 dispatch class alongside the existing h11/h2/async dispatch. Any urllib3 dispatch class would still be built on top of the underlying async structure, but would dispatch the urllib3 calls within a threadpool.

Doing so would allow a couple of useful things, such as being able to isolate behavioral differences between the two implementations, or perhaps allowing a more gradual switchover for critical services that need to take a cautious approach to upgrading to a new HTTP client implementation.

Trio, Curio

I think httpcore as currently delivered makes it fairly easy to deliver a trio-based concurrency backend. It's unclear to me if supporting that in the package itself is a good balance, or if it would be more maintainable to ensure that the trio team have have the interfaces they need, but that any implementation there would live within their ecosystem.

(I'd probably tend towards the later case there.)

Twisted

I guess that an HTTP/2 client would probably be useful to the Twisted team. I don't really know enough about Twisted's style of concurrency API to take a call on if there's work here that could end up being valuable to them.

HTTP/3

It'll be worth us keeping an eye on https://github.com/aiortc/aioquic

Having a QUIC implementation isn't the only thing that we'd need in order to add HTTP/3 support, but it is a really big first step.

We currently have connect/reader/writer interfaces. If we added QUIC support then we'd want our protocol interfaces to additionally support operations like "give me a new stream", and "set the flow control", "set the priority level".

For standard TCP-based HTTP/2 connections, "give me a new stream" would always just return the existing reader/writer pair. For QUIC connections it'd return a new reader/writer pair for a protocol-level stream.

This is getting way ahead of ourselves, but I think we've probably got a good basis here to be able to later support HTTP/3.

One big blocker would probably be whatever HTTP-level changes are required between HTTP/2 and HTTP/3 The diffs between QPACK vs HPACK is one cases here, but there's likely also differences given that the stream framing in HTTP/2 is at the HTTP-level, wheras the stream framing in HTTP/3 is at the transport-level.

It's unclear to me if these differences are sufficiently incremental that they could fall into the scope of a future hyper/h2 package or not, or what the division of responsibilities would look like.

One important point to draw out here is that the growing complexities from HTTP/1.1, to HTTP/2, to HTTP/3, mean that the Python community is absolutely going to need to need to tackle work in this space as a team effort - the layers in the stack need expertise in various differing areas.

Certificates

Right now we've using certifi for certificate checking. Christian Heimes has been doing some work in this space around accessing interfaces to the Operating System's certificate store. I might try to collar him at PyLondinium.

Any other feedback?

I'm aware that much of this might look like it's a bit premature, but the work is pretty progressed, even if I've not yet statrted focusing on any branding and documentation around it.

Are there other invested areas of the Python community that I'm not yet considering here?

Where are the urllib3, trio, requests, aiohttp teams heading in their own work in this space? Is there good scope for collaboration, and how do you think that could/should work?

What else am I missing?

tomchristie commented 5 years ago

Other scattered thoughts:

tiran commented 5 years ago

I opened https://bugs.python.org/issue37048 to track QUIC related changes to the SSL module.

njsmith commented 5 years ago

CC'ing some folks where I'm not sure if they've seen this or not: @pquentin @RatanShreshtha @nateprewitt @shazow @asvetlov @dstufft

@tomchristie: this is super cool, and thanks for starting the conversation.

I'll start by summarizing what's happening with the async-urllib3 work and what we've been thinking about there, so we can start figuring out how these different initiatives relate.

The async-urllib3 fork

For the last few years, me & @pquentin & @RatanShreshtha have been slowly working on adding async support to urllib3 (also incorporating some older work by @lukasa). The repo and issue tracker is here, and the basic approach is described here: https://github.com/urllib3/urllib3/issues/1323

What we've done so far

What's left to do

In general, my feeling is that the core HTTP functionality here is really solid. I think I heard @lukasa say once that it's easy to write 90% of an HTTP client; the last 10% is where all the work is. (I guess this true of everything, but even more so for HTTP.) The async-urllib3 branch doubtless has exciting new bugs we haven't found yet, but overall this is not a quick proof of concept, it's a serious attempt at a production library that handles almost all the edge cases I know about, including things that urllib3 has only figured out within the last few months. It even handles early server responses (which is a known problem with classic urllib3, and required multiple iterations to figure out how to make it supportable across multiple networking backends). Though, we do still need to figure out what to do about header casing – https://github.com/python-hyper/h11/issues/31.

There are a bunch of minor things we need to do (e.g. docs, asyncio backend), and also two major ones:

urllib3 vs async-urllib3 vs httpcore vs requests vs request3 vs idek

OK so that's what we've been working on what the issues we've found. What about the larger strategy? First, just to lay out my general assumptions:

If it's at all possible, our goal should be to converge on a single implementation of the core code for making HTTP requests, that almost everyone uses (either directly or via wrappers like requests is currently). HTTP clients have endless edge cases, so the more eyeballs we have on a single library, the more we can all benefit from each other's experiences. Right now in our urllib3 branch, the Trio-specific code is ~2% of the total library (not counting tests, contrib, . It's ridiculous that we can't share the other ~98%.

Right now urllib3 is kinda that, except that it doesn't handle async, hence the proliferation of async libraries.

Unfortunately urllib3 can't add async without at least some backcompat breakage, because of all the exposed internals. (The public API exposes that it's using http.client under the hood, it has dict-like interfaces that need to become async, etc.) And urllib3 is stupendously widely used, so our new library to rule them all is going to need a different name, and be parallel-installable to let people migrate gradually.

I still have hope that we can switch requests over to a new async-capable backend without breaking the world. The requests API is much smaller, and if we could pull it off this would (a) save a lot of migration work for people around the world, and (b) make the overall migration go much faster – which in turn means the folk here will get to (eventually) waste less energy on maintaining the old LTS releases of everything. In my perfect world, there's no requests3 package because we don't need it.

I don't have a strong opinion on Python 2 support right now. It's obviously getting less important every day. But the last stragglers are going to be projects like pip and botocore, which need a HTTP client, and would really like to have access to async support. Maybe they'll be happy with using different clients on py2 and py3 (and in pip's case, vendoring multiple clients)? I'm assuming requests itself will need to support py2 for another year+, and if py2 support is the difference between being able to switch requests vs having to convince everyone to migrate off requests, then that might be enough to make py2 support worth it. I don't really want to keep caring about py2, but my overriding goal is to minimize the number of HTTP libraries we all have to support, and if py2 makes a difference there I'm willing to hold my nose and do it. ...Depending on how hard it is to support py2, which we don't know yet either.

I'm not super interested in ASGI/WSGI integration – it's a neat feature that people will like, but not my main focus (and Trio will have the ability to mock out the network itself for testing, so you don't necessarily need this kind of support inside individual libraries). I do wonder how you'll provide an async API to WSGI apps or a sync API to ASGI apps, though?

I think talking about HTTP/2 is kinda premature, honestly. I looked at httpcore/dispatch/http2.py, and AFAICT it doesn't support outgoing flow control or PING handling (both of which are protocol violations), and it doesn't support multiplexing (which pretty much makes HTTP/2 support useless). And fixing these will require some substantial architectural changes, because they require background tasks and shared state across multiple connections. Which in turn will make it significantly more complicated to support multiple concurrency backends, and means you need to somehow disable HTTP/2 entirely when running in sync mode... it's a lot of extra complexity. I think we should be strategizing on the shortest path to something shippable, and HTTP/2 is not on the critical path for that. We definitely want to get there eventually, and we need to keep an eye on it to make sure we don't do anything that rules it out, but we don't want to get people excited about something that we can't deliver yet...

(BTW, we might also want to think about websocket client support eventually too – with HTTP/2 you can have HTTP and WS traffic over a single connection.)

Anyway. Looking at httpcore, my overall impression is ... surprisingly complementary to the async-urllib3 work? The async-urllib3 stuff is really strong on low-level protocol stuff, but the public API has a decade of accumulated cruft. httpcore feels like it's a few years away from handling all the gnarly edge cases, but the overall API and structure seem way more thought-through. I wonder if there's any way to combine forces on that basis?

theacodes commented 5 years ago

I believe beyond technical details we should establish a new PSF Work Group for HTTP to better acquire resources and funding to pay all of you to solve this problem.

There is no reason why we should be unorganized or alone in this. A PSF Work Group would allow us to better leverage fiscal sponsorship, governance, and cross-maintenance of projects.

From a technical perspective, my ideal world would be:

  1. Leave urllib3 and requests more or less alone. Every change is a breaking change. They have far too many users. Use tidelift to pay maintainers indefinitely.
  2. Establish shared, sans-io libraries that we can all build upon for http, http2, and http3/quic.
  3. Work together towards a brand new python-https project using the best from all of our experiences and the resources granted by the work group.
  4. Urllib3's value is in its exhaustive test suite - mine what we can and move into shared libraries and into the python-https project.
  5. Request's value is in its UX - borrow what makes sense.
tomchristie commented 5 years ago

I believe beyond technical details we should establish a new PSF Work Group for HTTP to better acquire resources and funding to pay all of you to solve this problem.

Seems very reasonable, yup. I'm not personally blocked by funding, since Encode's model is proving sufficient for my time at the moment, but certainly in terms of maintainance and long-term I think it's super important. I don't really know how the working groups function, but it'd likely be to everyone's benefit that whoever's heading up the governance aspects shouldn't also be the primary lead maintainer - keeping a clear division of responsibilities there is really helpful on both sides.

Establish shared, sans-io libraries that we can all build upon for http, http2, and http3/quic.

100%. That's the right level of seperation, and I'd be in favour of that even if we were only working with thread-concurrency wrappers on top, since the Sans-I/O model is just that much more clear and testable.

Urllib3's value is in its exhaustive test suite - mine what we can and move into shared libraries

Indeed.

Request's value is in its UX - borrow what makes sense.

My personal take would probably be to lean strongly towards the importance of API compatibility w/ requests. Not everywhere, but sufficiently so that teams oughta be able to switch over painlessly. I'd tend to think that the user expectations, brand, and ecosystem of requests would mean that a "requests" v3 or a "requests3" release would be a huge advantage, but I'm also okay with exploring a non-requests brand naming. Either way it's a conversation that we can defer any hard decisions on for the time being, until we've got something release-ready.

I still have hope that we can switch requests over to a new async-capable backend without breaking the world.

Same. The work in this package is aiming towards that. There's a few differences in places, such as:

I think talking about HTTP/2 is kinda premature, honestly.

Sure. Agree it's not on the critical path, though I'm more bullish than yourself on achieving it. The implementation does handle stream multiplexing, and the connection pool takes account of HTTP/2 vs. HTTP/1.1 connections accordingly, tho yes - no per-stream flow control / ping support yet etc. The existing http/2 module weighs in at only 150 lines, since h2 does all the heavy lifting. It'd also be trivial to support being able to configure which protocol versions a client instance should attempt to use, so we could eg. start with HTTP/2 turned off by default, but available as an option.

I'm not super interested in ASGI/WSGI integration ... I do wonder how you'll provide an async API to WSGI apps or a sync API to ASGI apps, though?

Sure. I'm finding it important for one thing because it's thrashing out some more underlying functionality that is a critical requirement - the ability to write either async or sync dispatch classes, and have the client be able to bridge to them seemlessly. I'm working this through at the moment, and belive I have it nailed, tho it's more involved than the initial pass which was just "we need a sync client and an async client".

Anyway. Looking at httpcore, my overall impression is ... surprisingly complementary to the async-urllib3 work? The async-urllib3 stuff is really strong on low-level protocol stuff, but the public API has a decade of accumulated cruft. httpcore feels like it's a few years away from handling all the gnarly edge cases, but the overall API and structure seem way more thought-through. I wonder if there's any way to combine forces on that basis?

Yup yup.

Most obvious potential points of collaboration from my POV would be:

Anyways, lots of great stuff here, thanks all.

tomchristie commented 5 years ago

Okay, so I think we're far enough along the road here that I think it's time to plant a stake in the ground and say "yeah, this is the direction we're going".

There's still various technical aspects to work on. In particular, stuff like:

I've taken on some of the awkward bits, such as the "early response handling", which as @njsmith noted is really quite fiddly. (For example, to get timeouts right, you want to try both reading and writing concurrently, but initially starting with only enforcing write timeouts, and later switching over to only enforcing read timeouts once you've either sent the entire request, or have started getting an early response.)

I'm also still wary of everything we're trying to take on here. Supporting HTTP/2, HTTP/3, seemless async+sync, multiple concurrency backends is a fair chunk of extra complexity on top of what the existing requests+urllib3 needs to deal with. The only way I can see of mitigating that is by really making sure that we're taking this on as a community endevour. I really like @theacodes' suggestion of an HTTP working group there.

I'm also not precluding the possibilities that we could also lean more on the urllib3 work, by working on either or both of a threaded urllib3 implementation, using the Dispatcher interface, or an async urllib3-bleach-spike implementation, using the AsyncDispatcher interface. That doesn't look stictly neccessary at the moment, but it'd be pretty wonderful if we could have two completely alternate dispatch implementations available to us.

There's a bunch of conversations that'd need to happen around eg. what GitHub organisation the project so be on, domain naming, docs branding, etc. but I think the first blocking thing that I'd really like to see happen is for this work to adopt the requests3 package on PyPI, so we can start cutting releases against that. Obvs. that's a big jump for the requests team to make, given that this is an entirely from-scratch implementation, but I think it's surely got to be the best next step forward.

tomchristie commented 5 years ago

Have pinged @kenneth-reitz on this.

sethmlarson commented 5 years ago

Personally I think we should leave urllib3 to its long-term support maintenance state instead of trying to revive it into the spotlight via a next-gen HTTP client library. When requests depended on urllib3 for dispatch we were essentially (and actually) treated as internal code. Why not just house that complexity within the client library itself rather than shelling it out and dealing with packaging synchronization problems? It'll certainly take more planning and careful design choice.

tomchristie commented 5 years ago

@sethmlarson - Agreed. I guess what I really meant to say there was that it might be helpful at some point if we had a old-school urllib3 dispatcher available as a third party package or whatever, so that we can more easily isolate any behavioral differences when dealing with any gnarly edge-case-ish behaviors.

tomchristie commented 5 years ago

I've slung together a requests3 branch, which demos how our existing docs would look if the project did take over the mantle of Requests III.

Docs build is here: http://www.encode.io/requests3-demo/

I'm conflicted about it, but I think that encode taking over ownership and responsibility for delivering requests3 is likely the best community outcome we could get, given the issues with the Requests III fundraiser. I think that'd actually represent a positive step forward for everyone.

The project is already what you’d expect from a requests v 3 release, and is API compatible most the way through, with a some documented exceptions, and a few bits of work outstanding.

I guess the proposal looks something like this:

As painful as it is, I think that'd probably also need to come alongside some kind of reasonable statement on the over-promise and under-delivery of the Requests III fundraiser. Failing to deliver in itself isn't exactly the main issue, but being unable to be open & transparent about it is.

If that's not something that we can agree on then we'll just need to push on with this project in it's current naming, which is fine, and will work out in the long term, tho we should expect much more gradual adoption. (Also there's no realistic way that "requests3" is going to actually land under that scenario.)

I'm happy to take feedback on any of the above, so long as folks keep in mind that it's a loaded topic, and pretty emotionally draining for all concerned.

πŸ’š

tomchristie commented 5 years ago

On reflection I'm less sure now that trying to pursue "requests3" for the sake of continuity is necessarily the best option.

A fresh project under the umbrella of python-http would be a cleaner approach, tho if we go that route, then I think this project doesn't have the right name yet.

Also open to the question of "should this live under encode, or should it live under python-http?".

That's related to a couple of other questions - eg. are we expecting to move dependencies, such as h2, h11 into the python-http organization? What's our plans on funding approaches?

Might raise some of this over on https://github.com/python-http/python-http.org instead.

sethmlarson commented 5 years ago

Member repositories don't necessarily have to live within the organization though they definitely can. The name "https" is available to us instead of something ending in 3.