The run monitor in dagster-daemon (dagster/_daemon/monitoring/run_monitoring.py) will forcibly mark the runs in STARTING status as failed if they spend more time than limitation. However, the monitor_starting_run function will just report a failure event without terminating the failed run.
Above mechanism works well until mixing with external resource management systems, like Kubernetes. If I use K8sRunLauncher in dagster-k8s, and there aren't enough resources(or affinity/toleration issues), the run will surely be in STARTING status for a very long time, the dangling Jobs/Pods are still there even after the run monitor marks the run as failed.
What did you expect to happen?
The provisioned resources got recycled after the STARTING runs are reported as failed by run monitor, i.e. the run_launcher.terminate are invoked after or before the instance.report_run_failed(run, msg) in dagster/_daemon/monitoring/run_monitoring.py#monitor_starting_run.
How to reproduce?
Open run monitoring mechanism of dagster daemon;
Use K8sRunLauncher in dagster-k8s to launch jobs;
Assign a very large resource request (or insane affinity) to the K8s Job;
Then there you go, you'll find the new runs always in STARTING status, then turnning to FAILURE status after several minutes, and the Jobs/Pods are still in the Kubernetes cluster, pending forever.
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
I'm facing the exact same issue.
A workaround would be to create a cronjob that frequently scan and remove jobs with pending jobs based on start_timeout_seconds value.
@ACMoretxj have you find a solution ?
Dagster version
1.3.13
What's the issue?
The run monitor in
dagster-daemon
(dagster/_daemon/monitoring/run_monitoring.py) will forcibly mark the runs inSTARTING
status as failed if they spend more time than limitation. However, themonitor_starting_run
function will just report a failure event without terminating the failed run.Above mechanism works well until mixing with external resource management systems, like Kubernetes. If I use
K8sRunLauncher
indagster-k8s
, and there aren't enough resources(or affinity/toleration issues), the run will surely be inSTARTING
status for a very long time, the dangling Jobs/Pods are still there even after the run monitor marks the run as failed.What did you expect to happen?
The provisioned resources got recycled after the
STARTING
runs are reported as failed by run monitor, i.e. therun_launcher.terminate
are invoked after or before theinstance.report_run_failed(run, msg)
in dagster/_daemon/monitoring/run_monitoring.py#monitor_starting_run.How to reproduce?
K8sRunLauncher
indagster-k8s
to launch jobs;Then there you go, you'll find the new runs always in
STARTING
status, then turnning toFAILURE
status after several minutes, and the Jobs/Pods are still in the Kubernetes cluster, pending forever.Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.