This may ultimately be tied to several different issues within Prefect, but I think addressing one or two scenarios would cover the majority of our related issues.
Scenario 1: A task crashes while all others complete but the run hangs. This seems like the crashed task does not exit properly (no retries are made), and a timeout could help to end the task (and allow Prefect to retry or at least recover the broader flow and exit in a crashed state).
Incorporating a task timeout might resolve this
This could also potentially be tied to our handling of task futures in datasets.py
Scenario 2: Issues with the scheduler/agent result in run / tasks being "lost" and hanging. The direct causes of this are a bit nebulous and can seemingly range from network communication lapses, the agent being overwhelmed by tasks running directly on the agent (via dask taskrunner rather than hpc), or simply very long running jobs.
Adhering to best practices for deployments - such as avoiding heavy overlaps, only using hpc task runners, managing errors better in deployments themselves - may help significantly minimize this scenario.
Without a clear root cause, it might be worth exploring a way to detect and address runs which are stuck as a result of this scenario so that they can be cancelled and restarted
This may ultimately be tied to several different issues within Prefect, but I think addressing one or two scenarios would cover the majority of our related issues.
Scenario 1: A task crashes while all others complete but the run hangs. This seems like the crashed task does not exit properly (no retries are made), and a timeout could help to end the task (and allow Prefect to retry or at least recover the broader flow and exit in a crashed state).
Scenario 2: Issues with the scheduler/agent result in run / tasks being "lost" and hanging. The direct causes of this are a bit nebulous and can seemingly range from network communication lapses, the agent being overwhelmed by tasks running directly on the agent (via dask taskrunner rather than hpc), or simply very long running jobs.