There may be a deadlock in the task scheduler that freezes or slows pipeline execution

We have an older Dagr pipeline that has been run many times (updated to use 040d12e though).

In very rare non-reproducible cases we appear to hit a deadlock that causes the pipeline to halt or creep to a glacial pace.

Conditions that may relate to the issue, or could simply be coincidences:

Some tasks have been scheduled under a subsequent retry after failure, eventually succeeding
Some tasks have been started but others are unknown to the task manager
In one unbounded case, a job that was estimated to take a few hours, took days before we terminated it

Final logs (before prematurely cancelling the job) look like:

TaskManager | Warning] ********************************************************************************
TaskManager | Warning] A single step in execution was > 30s (31s). | Warning] Found 14 tasks with status: is unknown
TaskManager | Warning] Found 6 tasks with status: has been started
TaskManager | Warning] Found 49 tasks with status: has succeeded

Because this is rare, and we can enforce TTL policies on the running of this pipeline, it's not critical we fix any underlying issue.

Simply posting the issue in case anyone else hits something similar, and wants to feel less alone!

fulcrumgenomics / dagr

There may be a deadlock in the task scheduler that freezes or slows pipeline execution #401