We have an older Dagr pipeline that has been run many times (updated to use 040d12e though).
In very rare non-reproducible cases we appear to hit a deadlock that causes the pipeline to halt or creep to a glacial pace.
Conditions that may relate to the issue, or could simply be coincidences:
Some tasks have been scheduled under a subsequent retry after failure, eventually succeeding
Some tasks have been started but others are unknown to the task manager
In one unbounded case, a job that was estimated to take a few hours, took days before we terminated it
Final logs (before prematurely cancelling the job) look like:
TaskManager | Warning] ********************************************************************************
TaskManager | Warning] A single step in execution was > 30s (31s). | Warning] Found 14 tasks with status: is unknown
TaskManager | Warning] Found 6 tasks with status: has been started
TaskManager | Warning] Found 49 tasks with status: has succeeded
Because this is rare, and we can enforce TTL policies on the running of this pipeline, it's not critical we fix any underlying issue.
Simply posting the issue in case anyone else hits something similar, and wants to feel less alone!
We have an older Dagr pipeline that has been run many times (updated to use 040d12e though).
In very rare non-reproducible cases we appear to hit a deadlock that causes the pipeline to halt or creep to a glacial pace.
Conditions that may relate to the issue, or could simply be coincidences:
Final logs (before prematurely cancelling the job) look like:
Because this is rare, and we can enforce TTL policies on the running of this pipeline, it's not critical we fix any underlying issue.
Simply posting the issue in case anyone else hits something similar, and wants to feel less alone!