Open wonjae-2352 opened 12 months ago
it could be that the process was killed suddenly without at a chance to write out the events to move the run in to a failed state. External run monitoring is used to detect this state: https://docs.dagster.io/deployment/run-monitoring#detecting-run-worker-crashes
If the k8s pod is still present, its possible the process is stuck which can be investigated further with https://github.com/dagster-io/dagster/discussions/14771
Thanks! Let me take a look and see if we can figure out the issue through the method :)
Dagster version
1.4.16
What's the issue?
We are having some dagster runs dangling without termination even if the internal step has failed (please see the attached image). The actual job had failed after 4 hours but the dagster run still in ‘started’ state after 23hrs
What did you expect to happen?
Dagster failed the job instead of staying being-running state
How to reproduce?
Many of our new jobs have this issue
Deployment type
Dagster Helm chart
Deployment details
We serve Dagster using azure k8s service & argo cd
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.