DataBiosphere / analysis_pipeline_WDL

Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
6 stars 3 forks source link

Cromwell memory hogging may result in sigkills + system wide Docker lockups #1

Closed aofarrel closed 3 years ago

aofarrel commented 3 years ago

This issue only occurs on local runs. It does not affect Terra users, and is very unlikely to affect non-Terra HPC users. As it appears to stem from a limitation of Cromwell itself, it may be unfixable in this WDL and/or indicative of a Cromwell bug.

The Apparent Underlying Problem

When run locally, Cromwell completely ignores runtime arguments relating to resource mangement, including memory, and there is no way for the user to force such limitations like they would be able to do on the cloud. It will additionally, by default, attempt to run scattered tasks simultaneously on a local machine. The combination of these means that Cromwell, especially but not only during scattered tasks, may hog too much memory on a local machine.

Effect 1: sigkills

This causes at least one issue -- at least some instances of scattered tasks are quite likely to get sigkilled if more than about 6 chromosomes are run simulatenously (the number six being derivied from a MacBook Pro, 16 GB, running Catalina). This can be detected by instances of a scattered task's rc file (as in return code) printing 137.

What does this have to do with Docker?

Based on my testing, it appears that scattered tasks also cause issues with Docker, but I can't be sure. I've now idea how on Earth this is happening, all I know is that there seems to be a correlation between resource-heavy pipelines, as this is the heaviest one I've made and is the only one to get this issue, and the mitigation strategies that help resolve sigkills seem to help prevent the Docker freezes. But unlike the sigkill effect, this isn't a quite as simple cause-and-effect, especially as a sigkill returns a pipeline failure rather than the freeze explained below.

Effect 2: System-wide Docker issues

This pipeline will occasionally freeze Docker across the system, and this freeze is persistent even after Cromwell has been stopped.

A full restart of Docker, which can be done within Docker Desktop, will resolve the freeze and allow for containers to be used again.

Mitigation Strategies

aofarrel commented 3 years ago

While testing on the ld_pruning branch, a Docker lockup happened again in the ld_pruning task. This confirms it isn't just one particular task that can cause this lockup. Interestingly both are scattered tasks, but ld_pruning was only running on four input files this time around.

aednichols commented 3 years ago

When run locally, Cromwell completely ignores runtime arguments relating to resource mangement

Correct, the runtime attributes cpu, memory, disks are not supported on the Local backend: https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/

The Local backend is not recommended for anything besides small tests with 1-2 simultaneous tasks.

aednichols commented 3 years ago

And by

1-2 simultaneous tasks

I mean concurrent-job-limit = 1 as documented here: https://cromwell.readthedocs.io/en/stable/backends/Backends/

aofarrel commented 3 years ago

Previously I was reluctant to close this as I had this issue occurring on non-scattered, non-simultaneous tasks and figured that the concurrent job limit may not be able to stop that. However, since changing that limit as @aednichols recommended I am delighted that I have not been able to replicate the issue for quite some time, so it's time to close this. 🥳

Thanks for the advice!

aednichols commented 3 years ago

Glad you got things working and thanks for the update!