Job is dangling without termination when step failed

wonjae-2352 commented 12 months ago

Dagster version

1.4.16

What's the issue?

We are having some dagster runs dangling without termination even if the internal step has failed (please see the attached image). The actual job had failed after 4 hours but the dagster run still in ‘started’ state after 23hrs

What did you expect to happen?

Dagster failed the job instead of staying being-running state

How to reproduce?

Many of our new jobs have this issue

Deployment type

Dagster Helm chart

Deployment details

We serve Dagster using azure k8s service & argo cd

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

alangenfeld commented 12 months ago

it could be that the process was killed suddenly without at a chance to write out the events to move the run in to a failed state. External run monitoring is used to detect this state: https://docs.dagster.io/deployment/run-monitoring#detecting-run-worker-crashes

If the k8s pod is still present, its possible the process is stuck which can be investigated further with https://github.com/dagster-io/dagster/discussions/14771

wonjae-2352 commented 11 months ago

Thanks! Let me take a look and see if we can figure out the issue through the method :)

dagster-io / dagster