FredHutch / shiny-cromwell

Shiny app for interacting with the Fred Hutch instances of Cromwell. Contact Amy Paguirigan
https://proof.fredhutch.org
MIT License
2 stars 3 forks source link

No notification/error for stalled workflow (No jobs in running/queued up) #107

Open sitapriyamoorthi opened 1 month ago

sitapriyamoorthi commented 1 month ago

I had a workflow running that had a last task output ~ 12 hours ago.

There dont appear to be any jobs queued up when checked on rhino.

The app says that a task is running, the job has a valid job ID but nothing has been output since 11 pm (writing this out at 10:30 am the following day)

The app does not show that the job has failed (no job failure metadata and no output on stderr for the last job/task)

atombaby commented 3 weeks ago

Was there an rc file in the execution directory?

I believe this is similar to the other "zombie job" problems where Cromwell is unable to determine the state of the job, usually because the job has exited and been pruned from the Slurm job queue, often without the rc file written in the execution directory.

There are and have been a few issues in Cromwell about this. This issue is particularly relevant, but has been marked closed without a lot of indication of what was fixed.

atombaby commented 3 weeks ago

Adding this for future reference- it's a script someone wrote to track down zombie jobs in an LSF environment. This could be adapted for our environment.