fulcrumgenomics / dagr

A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic GRaphs
MIT License
69 stars 14 forks source link

There may be a deadlock in the task scheduler that freezes or slows pipeline execution #401

Open clintval opened 2 years ago

clintval commented 2 years ago

We have an older Dagr pipeline that has been run many times (updated to use 040d12e though).

In very rare non-reproducible cases we appear to hit a deadlock that causes the pipeline to halt or creep to a glacial pace.

Conditions that may relate to the issue, or could simply be coincidences:

Final logs (before prematurely cancelling the job) look like:

TaskManager | Warning] ********************************************************************************
TaskManager | Warning] A single step in execution was > 30s (31s). | Warning] Found 14 tasks with status: is unknown
TaskManager | Warning] Found 6 tasks with status: has been started
TaskManager | Warning] Found 49 tasks with status: has succeeded

Because this is rare, and we can enforce TTL policies on the running of this pipeline, it's not critical we fix any underlying issue.

Simply posting the issue in case anyone else hits something similar, and wants to feel less alone!