dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.53k stars 1.45k forks source link

Dagster Daemon run monitor not terminating timeout runs after reporting them as failed #16535

Open ACMoretxj opened 1 year ago

ACMoretxj commented 1 year ago

Dagster version

1.3.13

What's the issue?

The run monitor in dagster-daemon (dagster/_daemon/monitoring/run_monitoring.py) will forcibly mark the runs in STARTING status as failed if they spend more time than limitation. However, the monitor_starting_run function will just report a failure event without terminating the failed run.

Above mechanism works well until mixing with external resource management systems, like Kubernetes. If I use K8sRunLauncher in dagster-k8s, and there aren't enough resources(or affinity/toleration issues), the run will surely be in STARTING status for a very long time, the dangling Jobs/Pods are still there even after the run monitor marks the run as failed.

What did you expect to happen?

The provisioned resources got recycled after the STARTING runs are reported as failed by run monitor, i.e. the run_launcher.terminate are invoked after or before the instance.report_run_failed(run, msg) in dagster/_daemon/monitoring/run_monitoring.py#monitor_starting_run.

How to reproduce?

  1. Open run monitoring mechanism of dagster daemon;
  2. Use K8sRunLauncher in dagster-k8s to launch jobs;
  3. Assign a very large resource request (or insane affinity) to the K8s Job;

Then there you go, you'll find the new runs always in STARTING status, then turnning to FAILURE status after several minutes, and the Jobs/Pods are still in the Kubernetes cluster, pending forever.

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

vDMG commented 3 months ago

I'm facing the exact same issue. A workaround would be to create a cronjob that frequently scan and remove jobs with pending jobs based on start_timeout_seconds value. @ACMoretxj have you find a solution ?