MarkusSintonen commented 4 months ago

Summary

First PR in series of optimizations to improve performance of httpcore and even reach performance levels of aiohttp. Related discussion https://github.com/encode/httpx/issues/3215#issuecomment-2157885121

Previously: old1

With PR: new1

Request latency is more stable and the overall duration of the benchmark improves by 3.5x.

The difference diminishes when server latency increases over 100ms or so.

Checklist

[x] I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
[x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
[x] I've updated the documentation accordingly.

tomchristie commented 4 months ago

@MarkusSintonen Thanks so much for your work on this. Could we isolate the benchmarking into a separate PR?

MarkusSintonen commented 4 months ago

@MarkusSintonen Thanks so much for your work on this. Could we isolate the benchmarking into a separate PR?

Added here https://github.com/encode/httpcore/pull/923

MarkusSintonen commented 4 months ago

Great, Also, is this mean we no more need anyio dependency for httpx[asyncio] installation on Python>=3.11 ?

Yes thats right, once we also bring back the native asyncio based network to _backends.

Actually I fixed the code now so that anyio is not required either in Python<3.11

T-256 commented 4 months ago

Yes thats right, once we also bring back the native asyncio based network to _backends.

I'll wait you to create a PR from your commit to track and review it separately.

tomchristie commented 4 months ago

This PR highlights to me that the approach to shielding here is wrong. We're using cancellation shielding solely to deal with shielding connection closes. Rather than use shielding what we actually ought to be doing is...

Unconditionally marking the connection state as closed. (Doesn't require I/O, can't be cancelled.)

And then...

Performing the network stream close. (If we perform a cancellation operation then it's reasonable that SSL connections might not be "gracefully" closed.)

This is a bit fiddly to get right, needs some careful attention to detail.

MarkusSintonen commented 4 months ago

@tomchristie do you mean the previous contextmanager based shielding also had issues? Requiring additional reordering of closing handlers? If it didnt have issues then we could still use the anyio CancelScope just for shielding? But it means the anyio dependency cant be fully removed.

MarkusSintonen commented 4 months ago

Or yeah did you mean the non-IO related state management should be extracted out from the async closing when shielding with asyncio.shield. That's probably right

tomchristie commented 4 months ago

@MarkusSintonen I've attempted to answer my intent here through implementation. See #927. (Currently draft, and approach would need to be applied also to the HTTP/2 implementation, and the closing behaviour in the connection pool)

MarkusSintonen commented 4 months ago

@MarkusSintonen I've attempted to answer my intent here through implementation. See #927. (Currently draft, and approach would need to be applied also to the HTTP/2 implementation, and the closing behaviour in the connection pool)

Big thanks! Now I understand, so yeah it was indeed about "separating" the sync state management and the async IO part. Ie the streams aclose is the cancellable part and the state management being separate from async flow can not be cancelled.

agronholm commented 3 months ago

Would you mind testing my latest PR branch (https://github.com/agronholm/anyio/pull/761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

MarkusSintonen commented 3 months ago

Would you mind testing my latest PR branch (agronholm/anyio#761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

Nice, thank you! I can try sometime next week.

MarkusSintonen commented 3 months ago

Would you mind testing my latest PR branch (https://github.com/agronholm/anyio/pull/761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

@agronholm I run the tests and it is indeed better with the fix. But seems there is still some overhead with the implementation when comparing to asyncio:

Baseline (Anyio 4.4.0):

Anyio PR: branch

Asyncio: asyncio

(Above tests were run with httpcore optimizations from other pending PRs to not count in the issues from httpcore itself)

agronholm commented 3 months ago

Would you mind testing my latest PR branch (agronholm/anyio#761)? In my own testing, it about halved the performance overhead of anyio.Lock on asyncio, and was actually faster than asyncio's own native Lock class.

@agronholm I run the tests and it is indeed better with the fix. But seems there is still some overhead with the implementation when comparing to asyncio:

Baseline (Anyio 4.4.0):

Anyio PR:

Asyncio:

(Above tests were run with httpcore optimizations from other pending PRs to not count in the issues from httpcore itself)

Have you identified where this overhead is coming from? Can we eliminate synchronization as the cause after that PR?

agronholm commented 3 months ago

If you run the benchmarks after replacing the locks and semaphores in the asyncio implementation with the corresponding AnyIO primitives, that should shed some light on the issue.

MarkusSintonen commented 3 months ago

If you run the benchmarks after replacing the locks and semaphores in the asyncio implementation with the corresponding AnyIO primitives, that should shed some light on the issue.

It seems to be the anyio.Lock. When I just replace it with asyncio.Lock the issue disappears (with all the rest still using anyio variants). (HTTP2 also utilizes the Semaphore but I haven't tested that here.)

agronholm commented 3 months ago

In the AnyIO variant, the acquisition of an unowned lock would be slower because this operation contains a yield point unlike in the asyncio variant. Would you mind testing against AnyIO lock after commenting out the checkpoint calls around lines 1690-1697 in backends/_asyncio.py?

MarkusSintonen commented 3 months ago

In the AnyIO variant, the acquisition of an unowned lock would be slower because this operation contains a yield point unlike in the asyncio variant. Would you mind testing against AnyIO lock after commenting out the checkpoint calls around lines 1690-1697 in backends/_asyncio.py?

Indeed, it is happening because of the await AsyncIOBackend.cancel_shielded_checkpoint() (at line 1694)

agronholm commented 3 months ago

Yeah, so the problem here is that this yield point is needed to maintain the same semantics as with Trio. This is Trio's Lock.acquire():

    @enable_ki_protection
    async def acquire(self) -> None:
        """Acquire the lock, blocking if necessary."""
        await trio.lowlevel.checkpoint_if_cancelled()
        try:
            self.acquire_nowait()
        except trio.WouldBlock:
            # NOTE: it's important that the contended acquire path is just
            # "_lot.park()", because that's how Condition.wait() acquires the
            # lock as well.
            await self._lot.park()
        else:
            await trio.lowlevel.cancel_shielded_checkpoint()

If there wasn't a yield point, then this could become a busy-wait loop:

async def loop(lock):
    while True:
        async with lock:
            print("I'm holding the lock")

MarkusSintonen commented 3 months ago

If there wasn't a yield point, then this could become a busy-wait loop:

So this is needed for the CancelScope logic in AnyIO? What if nothing is using the CancelScope in the caller stack? Is the great penalty avoidable if user is not actually using CancelScopes?

agronholm commented 3 months ago

No, the checkpoint_if_cancelled() call handles CancelScope support. The cancel_shielded_checkpoint() there prevents the busy-wait loop which will happen if you run the above loop with asyncio primitives, as they weren't designed with the right principles.

agronholm commented 3 months ago

To summarize, asyncio's incorrect design makes it faster here.

MarkusSintonen commented 3 months ago

Ok I see thanks for the explanation! So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D). httpcore doesnt utilize the _synchronization.AsyncLock in such ways so it seems its safe to replace the anyio.Lock with asyncio.Lock to not pay the penalty.

agronholm commented 3 months ago

Ok I see thanks for the explanation! So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D). httpcore doesnt utilize the _synchronization.AsyncLock in such ways so it seems its safe to replace the anyio.Lock with asyncio.Lock to not pay the penalty.

Only in the asyncio backend. In the AnyIO backend, you would need to use AnyIO primitives, lest you render the whole backend pointless.

agronholm commented 3 months ago

So in essence anyio.Lock can be used to work around the issue (do Python core devs know about it? :D).

Asyncio has so many design issues (the ill-fated uncancellation mechanism being that latest example) that at this point it's beyond saving. Even GvR (who's the original author of asyncio) told me to my face that he'd prefer people start using something else.

rattrayalex commented 2 months ago

What is the status of this PR? Should it be reviewed+merged, closed, or something else?

agronholm commented 2 months ago

What is the status of this PR? Should it be reviewed+merged, closed, or something else?

Perhaps we should see how https://github.com/agronholm/anyio/pull/761 affects the real-world performance before jumping the gun?

MarkusSintonen commented 2 months ago

What is the status of this PR? Should it be reviewed+merged, closed, or something else?

Perhaps we should see how agronholm/anyio#761 affects the real-world performance before jumping the gun?

Also posting here, I rerun the benchmarks with above fix from AnyIO using fast_acquire=True for Lock (and Semaphore) in httpcore. The results look pretty good with AnyIO fix and fast_acquire=True usage.

First results with all optimization PRs applied (to not count other httpcore issues in).

With fast_acquire=False:

With fast_acquire=True:

Lastly results without all optimization PRs applied (how it would look against httpcore master, with its other issues).

With fast_acquire=False:

With fast_acquire=True:

MarkusSintonen commented 2 months ago

Should we keep AnyIO but instead use the fast_acquire=True mode in _synchronization.py? Cc @tomchristie

agronholm commented 1 month ago

FYI, AnyIO 4.5.0 is out now with the fast_acquire option.

tomchristie commented 1 month ago

Should we keep AnyIO but instead use the fast_acquire=True mode in _synchronization.py?

Exercise caution. What does the pull request for that look like? That'll be a useful highlighter.

MarkusSintonen commented 1 month ago

Exercise caution. What does the pull request for that look like? That'll be a useful highlighter.

Here is the PR https://github.com/encode/httpcore/pull/953/files

agronholm commented 1 month ago

I see a slight issue there, not among the changes but still: AnyIO 4.5.0 requires trio >= 0.26.1 in its trio extra, but those requirements limit it to < 0.26.0.

encode / httpcore

Use `asyncio` synchronization instead of `anyio` #922

Summary

Checklist