Open MarkusSintonen opened 4 months ago
@MarkusSintonen Thanks so much for your work on this. Could we isolate the benchmarking into a separate PR?
@MarkusSintonen Thanks so much for your work on this. Could we isolate the benchmarking into a separate PR?
Added here https://github.com/encode/httpcore/pull/923
Great, Also, is this mean we no more need
anyio
dependency forhttpx[asyncio]
installation onPython>=3.11
?
Yes thats right, once we also bring back the native asyncio based network to _backends
.
Actually I fixed the code now so that anyio
is not required either in Python<3.11
Yes thats right, once we also bring back the native asyncio based network to
_backends
.
I'll wait you to create a PR from your commit to track and review it separately.
This PR highlights to me that the approach to shielding here is wrong. We're using cancellation shielding solely to deal with shielding connection closes. Rather than use shielding what we actually ought to be doing is...
And then...
This is a bit fiddly to get right, needs some careful attention to detail.
@tomchristie do you mean the previous contextmanager based shielding also had issues? Requiring additional reordering of closing handlers? If it didnt have issues then we could still use the anyio CancelScope just for shielding? But it means the anyio dependency cant be fully removed.
Or yeah did you mean the non-IO related state management should be extracted out from the async closing when shielding with asyncio.shield. That's probably right
@MarkusSintonen I've attempted to answer my intent here through implementation. See #927. (Currently draft, and approach would need to be applied also to the HTTP/2 implementation, and the closing behaviour in the connection pool)
@MarkusSintonen I've attempted to answer my intent here through implementation. See #927. (Currently draft, and approach would need to be applied also to the HTTP/2 implementation, and the closing behaviour in the connection pool)
Big thanks! Now I understand, so yeah it was indeed about "separating" the sync state management and the async IO part. Ie the streams aclose is the cancellable part and the state management being separate from async flow can not be cancelled.
Would you mind testing my latest PR branch (https://github.com/agronholm/anyio/pull/761)? In my own testing, it about halved the performance overhead of anyio.Lock
on asyncio, and was actually faster than asyncio's own native Lock
class.
Would you mind testing my latest PR branch (agronholm/anyio#761)? In my own testing, it about halved the performance overhead of
anyio.Lock
on asyncio, and was actually faster than asyncio's own nativeLock
class.
Nice, thank you! I can try sometime next week.
Would you mind testing my latest PR branch (https://github.com/agronholm/anyio/pull/761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.
@agronholm I run the tests and it is indeed better with the fix. But seems there is still some overhead with the implementation when comparing to asyncio:
Baseline (Anyio 4.4.0):
Anyio PR:
Asyncio:
(Above tests were run with httpcore optimizations from other pending PRs to not count in the issues from httpcore itself)
Would you mind testing my latest PR branch (agronholm/anyio#761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.
@agronholm I run the tests and it is indeed better with the fix. But seems there is still some overhead with the implementation when comparing to asyncio:
Baseline (Anyio 4.4.0):
Anyio PR:
Asyncio:
(Above tests were run with httpcore optimizations from other pending PRs to not count in the issues from httpcore itself)
Have you identified where this overhead is coming from? Can we eliminate synchronization as the cause after that PR?
If you run the benchmarks after replacing the locks and semaphores in the asyncio
implementation with the corresponding AnyIO primitives, that should shed some light on the issue.
If you run the benchmarks after replacing the locks and semaphores in the
asyncio
implementation with the corresponding AnyIO primitives, that should shed some light on the issue.
It seems to be the anyio.Lock
. When I just replace it with asyncio.Lock
the issue disappears (with all the rest still using anyio
variants). (HTTP2 also utilizes the Semaphore
but I haven't tested that here.)
In the AnyIO variant, the acquisition of an unowned lock would be slower because this operation contains a yield point unlike in the asyncio variant. Would you mind testing against AnyIO lock after commenting out the checkpoint calls around lines 1690-1697 in backends/_asyncio.py
?
In the AnyIO variant, the acquisition of an unowned lock would be slower because this operation contains a yield point unlike in the asyncio variant. Would you mind testing against AnyIO lock after commenting out the checkpoint calls around lines 1690-1697 in
backends/_asyncio.py
?
Indeed, it is happening because of the await AsyncIOBackend.cancel_shielded_checkpoint()
(at line 1694)
Yeah, so the problem here is that this yield point is needed to maintain the same semantics as with Trio. This is Trio's Lock.acquire()
:
@enable_ki_protection
async def acquire(self) -> None:
"""Acquire the lock, blocking if necessary."""
await trio.lowlevel.checkpoint_if_cancelled()
try:
self.acquire_nowait()
except trio.WouldBlock:
# NOTE: it's important that the contended acquire path is just
# "_lot.park()", because that's how Condition.wait() acquires the
# lock as well.
await self._lot.park()
else:
await trio.lowlevel.cancel_shielded_checkpoint()
If there wasn't a yield point, then this could become a busy-wait loop:
async def loop(lock):
while True:
async with lock:
print("I'm holding the lock")
If there wasn't a yield point, then this could become a busy-wait loop:
So this is needed for the CancelScope
logic in AnyIO? What if nothing is using the CancelScope
in the caller stack? Is the great penalty avoidable if user is not actually using CancelScope
s?
No, the checkpoint_if_cancelled()
call handles CancelScope
support. The cancel_shielded_checkpoint()
there prevents the busy-wait loop which will happen if you run the above loop with asyncio
primitives, as they weren't designed with the right principles.
To summarize, asyncio's incorrect design makes it faster here.
Ok I see thanks for the explanation! So in essence anyio.Lock
can be used to work around the issue (do Python core devs know about it? :D). httpcore
doesnt utilize the _synchronization.AsyncLock
in such ways so it seems its safe to replace the anyio.Lock
with asyncio.Lock
to not pay the penalty.
Ok I see thanks for the explanation! So in essence
anyio.Lock
can be used to work around the issue (do Python core devs know about it? :D).httpcore
doesnt utilize the_synchronization.AsyncLock
in such ways so it seems its safe to replace theanyio.Lock
withasyncio.Lock
to not pay the penalty.
Only in the asyncio
backend. In the AnyIO
backend, you would need to use AnyIO primitives, lest you render the whole backend pointless.
So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D).
Asyncio has so many design issues (the ill-fated uncancellation mechanism being that latest example) that at this point it's beyond saving. Even GvR (who's the original author of asyncio) told me to my face that he'd prefer people start using something else.
What is the status of this PR? Should it be reviewed+merged, closed, or something else?
What is the status of this PR? Should it be reviewed+merged, closed, or something else?
Perhaps we should see how https://github.com/agronholm/anyio/pull/761 affects the real-world performance before jumping the gun?
What is the status of this PR? Should it be reviewed+merged, closed, or something else?
Perhaps we should see how agronholm/anyio#761 affects the real-world performance before jumping the gun?
Also posting here, I rerun the benchmarks with above fix from AnyIO using fast_acquire=True
for Lock (and Semaphore) in httpcore. The results look pretty good with AnyIO fix and fast_acquire=True
usage.
First results with all optimization PRs applied (to not count other httpcore issues in).
With fast_acquire=False
:
With fast_acquire=True
:
Lastly results without all optimization PRs applied (how it would look against httpcore master, with its other issues).
With fast_acquire=False
:
With fast_acquire=True
:
Should we keep AnyIO but instead use the fast_acquire=True
mode in _synchronization.py
? Cc @tomchristie
FYI, AnyIO 4.5.0 is out now with the fast_acquire
option.
Should we keep AnyIO but instead use the fast_acquire=True mode in _synchronization.py?
Exercise caution. What does the pull request for that look like? That'll be a useful highlighter.
Exercise caution. What does the pull request for that look like? That'll be a useful highlighter.
Here is the PR https://github.com/encode/httpcore/pull/953/files
I see a slight issue there, not among the changes but still: AnyIO 4.5.0 requires trio >= 0.26.1 in its trio
extra, but those requirements limit it to < 0.26.0.
Summary
First PR in series of optimizations to improve performance of
httpcore
and even reach performance levels ofaiohttp
. Related discussion https://github.com/encode/httpx/issues/3215#issuecomment-2157885121Previously:
With PR:
Request latency is more stable and the overall duration of the benchmark improves by 3.5x.
The difference diminishes when server latency increases over 100ms or so.
Checklist