Closed chrisguidry closed 1 day ago
Current state as of 2024-06-24:
I've fixed a number of client-side race conditions and gaps that caused TaskWaiter
s to miss completion events, and now the next issue I'm tracking down has to do with what happens server-side during reconnections to the events websocket.
In my testing, I believe I've eliminated the source of the hangs by adding an additional websocket backfill stage to catch any stragglers. This hasn't been released to production yet, and I want to spend a little more time testing and thinking about optimizations before I close this.
A counterpart to https://github.com/PrefectHQ/prefect/issues/14092; when running a load test suite against Cloud with deeply nested tasks:
We can see hangs when we start to get over 100 concurrent
distance
tasks that can't be accounted for by the deadlock phenomenon in #14092. Our theories at the moment are that these are missed events on the events websockets due to either:a) a race condition client-side between connecting to the socket and waiting for the first events, or b) a race condition server-side relating to the gap between backfilling events and catching up with real-time events