PrefectHQ / prefect

Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
https://prefect.io
Apache License 2.0
15.29k stars 1.5k forks source link

Task worker hangs when waiting for futures to resolve against Cloud #14098

Closed chrisguidry closed 1 day ago

chrisguidry commented 1 week ago

A counterpart to https://github.com/PrefectHQ/prefect/issues/14092; when running a load test suite against Cloud with deeply nested tasks:

from pydantic import BaseModel

from prefect import task  # type: ignore
from prefect.task_worker import serve

@task
def add(x: float, y: float) -> float:
    return x + y

@task
def subtract(x: float, y: float) -> float:
    return x - y

@task
def square(x: float) -> float:
    return x**2

@task
def square_root(x: float) -> float:
    return x**0.5

class Point(BaseModel, frozen=True):
    x: float
    y: float

    def __str__(self) -> str:
        return f"({self.x}, {self.y})"

@task
def distance(a: Point, b: Point) -> float:
    return square_root.delay(
        add.delay(
            square.delay(subtract.delay(b.x, a.x)),
            square.delay(subtract.delay(b.y, a.y)),
        )
    ).result()

if __name__ == "__main__":
    from tasks import add, distance, square, square_root, subtract

    serve(
        add,
        subtract,
        square,
        square_root,
        distance,
        limit=None,
        status_server_port=4422,
    )

We can see hangs when we start to get over 100 concurrent distance tasks that can't be accounted for by the deadlock phenomenon in #14092. Our theories at the moment are that these are missed events on the events websockets due to either:

a) a race condition client-side between connecting to the socket and waiting for the first events, or b) a race condition server-side relating to the gap between backfilling events and catching up with real-time events

chrisguidry commented 2 days ago

Current state as of 2024-06-24:

I've fixed a number of client-side race conditions and gaps that caused TaskWaiters to miss completion events, and now the next issue I'm tracking down has to do with what happens server-side during reconnections to the events websocket.

chrisguidry commented 1 day ago

In my testing, I believe I've eliminated the source of the hangs by adding an additional websocket backfill stage to catch any stragglers. This hasn't been released to production yet, and I want to spend a little more time testing and thinking about optimizations before I close this.