BD2KGenomics / toil-rnaseq

UC Santa Cruz Computational Genomics Lab's Toil-based RNA-seq pipeline
Apache License 2.0
38 stars 10 forks source link

Docker version locking up on stacker and openstack #115

Closed rcurrie closed 6 years ago

rcurrie commented 7 years ago

I'm having trouble with the docker not completing in different ways on samples that I ran a month ago with no problems. The only difference is the version of docker, I suspect, have been updated on the host machines. One was stacker so plenty of storage, memory and cores. The other was an openstack machine but still with plenty of storage (2TB), memory (120G) and cores (15). Log files from each are attached.

Openstack Docker version 17.07.0-ce, build 8784753 Stacker Docker version 17.03.0-ce, build 60ccb22 RNASeq Docker: quay.io/ucsc_cgl/rnaseq-cgl-pipeline:3.3.4-1.12.3

@jpfeil has also had some issues including an error on initial startup involving json problems (see trace below) which I had as well but re-starting cleared it. Also leading me to believe there is some sort of issue with the host docker and maybe the format of messages back.

[10.50.101.175] out: 61799e64ef6c: Pull complete [10.50.101.175] out: Digest: sha256:785eee9f750ab91078d84d1ee779b6f74717eafc09e49da817af6b87619b0756 [10.50.101.175] out: Status: Downloaded newer image for quay.io/ucsc_cgl/rnaseq-cgl-pipeline:3.3.4-1.12.3 [10.50.101.175] out: json: cannot unmarshal array into Go value of type types.ContainerJSON [10.50.101.175] out: Traceback (most recent call last): [10.50.101.175] out: File "/opt/rnaseq-pipeline/wrapper.py", line 304, in [10.50.101.175] out: main() [10.50.101.175] out: File "/opt/rnaseq-pipeline/wrapper.py", line 271, in main [10.50.101.175] out: blob = json.loads(subprocess.check_output(['docker', 'inspect', name])) [10.50.101.175] out: File "/usr/lib/python2.7/subprocess.py", line 573, in check_output [10.50.101.175] out: raise CalledProcessError(retcode, cmd, output=output) [10.50.101.175] out: subprocess.CalledProcessError: Command '['docker', 'inspect', '']' returned non-zero exit status 1 [10.50.101.175] out: Makefile:30: recipe for target 'expression' failed [10.50.101.175] out: make: *** [expression] Error 1 log.openstack.txt log.stacker.txt

rcurrie commented 7 years ago

I'm going to try another sample, as well as take bamqc out of the picture and report back. If anyone see's something in the logs that rings a bell let me know.

rcurrie commented 7 years ago

I think I've narrowed it down to some new interaction with bamqc - without it the samples complete, with it things lock up waiting on the cpu to be freed up. Will run another test and report back with more specifics:

6bba815f952e 2017-09-19 16:54:52,264 Thread-17 DEBUG toil.batchSystems.singleMachine: Acquiring 2147483648 bytes of memory from a pool of 127739600896. 6bba815f952e 2017-09-19 16:54:52,264 Thread-17 DEBUG toil.batchSystems.singleMachine: Acquiring 150 fractional cores from a pool of 110 to satisfy a request of 15.000000 cor es 6bba815f952e 2017-09-19 16:54:52,730 Thread-11 DEBUG toil.batchSystems.singleMachine: Could not acquire enough (cores) to run job. Requested: (150), Avaliable: 110. Sleeping

rcurrie commented 7 years ago

Turning off bamqc allowed two samples to complete. Re-running all 4 on clean openstack machines to verify that they all complete. Maybe these new docker versions are having some strange interaction with toil's calling sub-dockers?

rcurrie commented 7 years ago

@walt @jvivian have you run with either of these docker versions lately? With bamqc?

jvivian commented 6 years ago

@rcurrie — I haven't had any issues, but I've only been running the Python version.

rcurrie commented 6 years ago

In that mode is the only ‘docker’ involved bamqc? i.e. you’re not running inside a container and then asking for a sister container?

On Sep 20, 2017, at 11:18 AM, John Vivian notifications@github.com wrote:

@rcurrie — I haven't had any issues, but I've only been running the Python version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jvivian commented 6 years ago

@rcurrie — Correct, I'm invoking the pipeline via toil-rnaseq run ....

I have an openstack node up now, let me run a sample through with bamQC turned on. I'll let you know if I hit any problems.

jvivian commented 6 years ago

@rcurrie — Test sample went through just fine. I'll go ahead and run a large sample as well.

rcurrie commented 6 years ago

Check what version of docker you are running - I’m suspecting more and more that something changed based on the JSON error (daemon returns json to caller) and that the call to the daemon to launch a parent, and waiting for it to complete maybe be the culprit.

On Sep 20, 2017, at 11:53 AM, John Vivian notifications@github.com wrote:

@rcurrie — Test sample went through just fine. I'll go ahead and run a large sample as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jvivian commented 6 years ago

Docker version 1.9.1, build a34a1d5. The Docker version of the pipeline you're pulling down is for 1.12.3, which has to match the host version.

rcurrie commented 6 years ago

Hmmm….1.9.1 is ‘vintage’, 1.12.3 is vintage as well. I’ll try and downgrade to 1.12.3 and see if that fixes it.

That said most external people will end up using the latest docker which is this whole new community vs. enterprise. This may really be a general issue in Toil land manifest in your pipeline. If so then I can move this all over to a Toil issue.

On Sep 20, 2017, at 12:07 PM, John Vivian notifications@github.com wrote:

Docker version 1.9.1, build a34a1d5. The Docker version of the pipeline you're pulling down is for 1.12.3, which has to match the host version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jvivian commented 6 years ago

Hmmm….1.9.1 is ‘vintage’, 1.12.3 is vintage as well. I’ll try and downgrade to 1.12.3 and see if that fixes it.

The Docker containers have always required that the host version of Docker matches the Docker version associated with the container (part of the tag), otherwise there's a client / server mismatch when mounting the docker socket. It's one of the many reasons I don't recommend the Dockerized version of the pipeline.

That said most external people will end up using the latest docker which is this whole new community vs. enterprise. This may really be a general issue in Toil land manifest in your pipeline. If so then I can move this all over to a Toil issue.

If you run the Python version of the pipeline, you should be able to use any Docker version unless it's ancient (like pre 1.3 or something).

jvivian commented 6 years ago

@rcurrie — Did downgrading solve it or is it still locking up?

It's definitely hanging on bamQC which defaults to a max of 4 cores — when it hangs are you able to see if any processes or Docker containers are running?

rcurrie commented 6 years ago

First I tried re-running with bamqc turned off, and unfortunately 2 of the 4 samples fail RNASeq that previously ran fine. I’ll downgrade (I’m using docker-machine so its a bit tricky), re-run one the weekend and report back.

On Sep 21, 2017, at 7:25 PM, John Vivian notifications@github.com wrote:

@rcurrie — Did downgrading solve it or is it still locking up?

It's definitely hanging on bamQC which defaults to a max of 4 cores — when it hangs are you able to see if any processes or Docker containers are running?

When it hangs the bamqc docker as well as the rnaseq docker are both running with the latter stuck trying to allocate additional cpu - which is odd as in this case its 10+ hours after bamqc started and presumably all the other rnaseq tasks (star, rsem, kallisto etc….) will have completed. But there is no expression output at all so it could be that the other tasks are stuck (net of star which is required before bamqc) and that is the source of the deadlock.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jvivian commented 6 years ago

When it hangs the bamqc docker as well as the rnaseq docker are both running with the latter stuck trying to allocate additional cpu - which is odd as in this case its 10+ hours after bamqc started and presumably all the other rnaseq tasks (star, rsem, kallisto etc….) will have completed.

BamQC gets scheduled at the same time as RSEM since it depends on STAR. BamQC hangs using 4 cores, but Kallisto or RSEM won't run unless they get all the cores on the box so it just stalls waiting for the cores to free up.

I’ll downgrade (I’m using docker-machine so its a bit tricky), re-run one the weekend and report back.

Alternatively, you could try pip install toil-rnaseq==3.3.4 which will give identical results and ideally won't stall.

rcurrie commented 6 years ago

Downgrading to Docker 1.12.6-cs8 (used a .deb) allowed the two that failed completely even with bamqc off to complete. I'm now going to run all 4 with bamqc, will report back. Net of that the allocation you mention seems un-workable: Kallisto and RSEM want all the cores, but BamQC takes 4 - doesn't that mean Total - 4 will be idling? Is there any way to convince Kallisto and RSEM to just take N-4? (There's 15 cores on this openstack instance apparently). Seems like this is an issue regardless of Docker, no?

jvivian commented 6 years ago

Is there any way to convince Kallisto and RSEM to just take N-4?

If BamQC is running, the core requirements for Kallisto and RSEM could be adjusted to use N-4, but this is only optimal for the use case when running a single sample on a single node with BamQC on (it's off by default). Toil pipelines are designed to scale, .e.g if I have 10 machines and 100 samples running, there aren't going to be many idle nodes because Mesos will greedily fill nodes based on resource requirements.

Net of that the allocation you mention seems un-workable

Unfortunately, this is an issue with any batch system — Hannes and I tried to make a "less strict" core requirement heuristic for single machine so, for example, a 15 core job would start if any cores on the machine were free, but this creates a system which intentionally disobeys how typical batch systems are run and ends up with the problems reported in that thread. If you think of any workaround let me know!

Closing thread. If the samples don't complete or you run into other issues please reopen.

rcurrie commented 6 years ago

Followup: All the samples that were failing ran to completion with bamqc under docker v1.12.6-cs8.