Closed aofarrel closed 3 years ago
While testing on the ld_pruning branch, a Docker lockup happened again in the ld_pruning task. This confirms it isn't just one particular task that can cause this lockup. Interestingly both are scattered tasks, but ld_pruning was only running on four input files this time around.
When run locally, Cromwell completely ignores runtime arguments relating to resource mangement
Correct, the runtime attributes cpu
, memory
, disks
are not supported on the Local backend: https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/
The Local backend is not recommended for anything besides small tests with 1-2 simultaneous tasks.
And by
1-2 simultaneous tasks
I mean concurrent-job-limit = 1
as documented here: https://cromwell.readthedocs.io/en/stable/backends/Backends/
Previously I was reluctant to close this as I had this issue occurring on non-scattered, non-simultaneous tasks and figured that the concurrent job limit may not be able to stop that. However, since changing that limit as @aednichols recommended I am delighted that I have not been able to replicate the issue for quite some time, so it's time to close this. 🥳
Thanks for the advice!
Glad you got things working and thanks for the update!
This issue only occurs on local runs. It does not affect Terra users, and is very unlikely to affect non-Terra HPC users. As it appears to stem from a limitation of Cromwell itself, it may be unfixable in this WDL and/or indicative of a Cromwell bug.
The Apparent Underlying Problem
When run locally, Cromwell completely ignores runtime arguments relating to resource mangement, including
memory
, and there is no way for the user to force such limitations like they would be able to do on the cloud. It will additionally, by default, attempt to run scattered tasks simultaneously on a local machine. The combination of these means that Cromwell, especially but not only during scattered tasks, may hog too much memory on a local machine.Effect 1: sigkills
This causes at least one issue -- at least some instances of scattered tasks are quite likely to get sigkilled if more than about 6 chromosomes are run simulatenously (the number six being derivied from a MacBook Pro, 16 GB, running Catalina). This can be detected by instances of a scattered task's
rc
file (as in return code) printing 137.What does this have to do with Docker?
Based on my testing, it appears that scattered tasks also cause issues with Docker, but I can't be sure. I've now idea how on Earth this is happening, all I know is that there seems to be a correlation between resource-heavy pipelines, as this is the heaviest one I've made and is the only one to get this issue, and the mitigation strategies that help resolve sigkills seem to help prevent the Docker freezes. But unlike the sigkill effect, this isn't a quite as simple cause-and-effect, especially as a sigkill returns a pipeline failure rather than the freeze explained below.
Effect 2: System-wide Docker issues
This pipeline will occasionally freeze Docker across the system, and this freeze is persistent even after Cromwell has been stopped.
A full restart of Docker, which can be done within Docker Desktop, will resolve the freeze and allow for containers to be used again.
Mitigation Strategies
rc
files and see if they are getting sigkilled (137).