aiddata / geo-datasets

Scripts for preparing datasets in GeoQuery
http://geoquery.org
MIT License
20 stars 11 forks source link

Reliable timeout for Prefect tasks/deployments #159

Open sgoodm opened 1 year ago

sgoodm commented 1 year ago

This may ultimately be tied to several different issues within Prefect, but I think addressing one or two scenarios would cover the majority of our related issues.

Scenario 1: A task crashes while all others complete but the run hangs. This seems like the crashed task does not exit properly (no retries are made), and a timeout could help to end the task (and allow Prefect to retry or at least recover the broader flow and exit in a crashed state).

Scenario 2: Issues with the scheduler/agent result in run / tasks being "lost" and hanging. The direct causes of this are a bit nebulous and can seemingly range from network communication lapses, the agent being overwhelmed by tasks running directly on the agent (via dask taskrunner rather than hpc), or simply very long running jobs.