Bcbio jobs running with Cromwell are crashing due to high cluster load (memory? open files?)

amizeranschi commented 5 years ago

Hello,

I'm getting intermittent errors when running tests with 100 small FASTQ files on an SGE cluster using Cromwell. The jobs are sometimes able to finish fine, but more often than not (4 out of the last 5 tries), they are crashing with errors such as the one below.

Several pages mention a problem with "cluster load" and memory getting used up was mentioned, but could this also be a problem with hitting the limit on simultaneously open files, rather than memory?

[2019-03-11 09:58:22,49] [error] WorkflowManagerActor Workflow 0d35763d-1bd3-4d23-a7e8-59c099ca3d31 failed (during ExecutingWorkflowState): java.lang.RuntimeException: Unable to start job. Check the stderr file for possible errors: /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/0d35763d-1bd3-4d23-a7e8-59c099ca3d31/call-pipeline_summary/shard-430/execution/stderr.submit
    at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.$anonfun$execute$2(SharedFileSystemAsyncJobExecutionActor.scala:165)
    at scala.util.Either.fold(Either.scala:188)
    at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:160)
    at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:155)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:215)
    at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:637)
    at scala.util.Try$.apply(Try.scala:209)
    at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:637)
    at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:637)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:215)
    at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:952)
    at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:944)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:215)
    at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
    at cromwell.core.retry.Retry$.withRetry(Retry.scala:38)
    at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
    at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
    at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:215)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
    at akka.actor.ActorCell.invoke(ActorCell.scala:557)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The contents of the referenced stderr file is the following:

$ cat /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/0d35763d-1bd3-4d23-a7e8-59c099ca3d31/call-pipeline_summary/shard-430/execution/stderr.submit
error: commlib error: got select error (Connection reset by peer)
Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error)..
Exiting.

I found a seemingly related error report (https://github.com/aws/aws-parallelcluster/issues/26) where someone mentioned that this could be a problem with the cluster load. A different page (https://www.t3.gsic.titech.ac.jp/en/node/176) confirms this:

    Unable to run job: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).  or  Unable to run job: got no response from JSV script"/apps/t3/sles12sp2/uge/customize/jsv.pl". 

Communication with the job scheduler will time out and the above error message may be displayed if the management node becomes in a state of high load due to a large amount of job input in a short time. The state of high load is temporary. Please try again after waiting a while.

A bit more googling for this error came up with a report mentioning that a cause could be that RAM is getting used up on the master node: https://github.com/FCP-INDI/C-PAC/issues/13. The node that I'm using as master node for these jobs has 128 GB of RAM, so it seems a bit surprising that scheduling 100 small samples could use up all that amount of memory.

My tests involve variant calling with 5 tools (GATK, freebayes, samtools, varscan and platypus). Given that the FASTQ files are small, perhaps the VC jobs are being scheduled too fast for the master node to be able to handle?

In case it's relevant, here is the output of ulimit -a from one of the compute nodes. Could the open files limit (1024) be the cause of these errors?

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 510853
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 16777216
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 510853
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

matthdsm commented 5 years ago

Hi, I've had a similar experience with cromwell and some bigger jobs. In my case, I was trying to joint call about 3K exome gvcf's. Due to the splitting strategy I used I hit the ulimit very fast. I was able to resolve this with the HPC admin though.

My suggestion for you would be to limit the number of jobs cromwell can have running at a time. You can do this using the –joblimit param in bcbio_vm.py, as stated here.

Also keep in mind cromwell can be very memory hungry, especially when using an in-memory db for the jobs.

Hope this helps Cheers M

amizeranschi commented 5 years ago

@matthdsm thanks a lot for your reply.

Which particular ulimit were you triggering during your tests? I'm guessing the open files one, is this correct?

The reason I ask is that I'm guessing this is going to become a problem for me as well when testing with regular-sized data (exome- and whole-genome-seq) instead of the small files that I'm now using for testing. I'll have to get in touch with the cluster admin, but I first need to figure out what to ask for, specifically.

Thanks for mentioning the -joblimit option as well, I never noticed it when first going through that bit of the documentation.

And regarding the memory usage of Cromwell, by any chance, are you referring to the cromwell_work/persist/metadata.lobs file? It takes up 36GB of space for the small 100 FASTQ files that I was using. Does all of this need to stay in memory for each of the Cromwell jobs, or only for the controller? I have a second issue open about this (https://github.com/bcbio/bcbio-nextgen/issues/2722) and I was wondering if these problems were somehow related.

matthdsm commented 5 years ago

The ulimit for my account had to be raised to about 1M. This is quite high, but then again so was the amount of data I was working with. I don't think you're going to have to go so extreme.

I also don't think the size of your data (big fastq vs small fastq) is going to impact the memory usage. Cromwell keeps references to your files in memory, not the files itself. If I remember correctly, I also played around a bit with some of the config options for cromwell (symlink vs hard links), but I'm not really sure, it's been a while.

In your case, I'd start by playing around with the joblimit param and see how far that gets you. You can also consider splitting your data in half and analyzing your fq's 50 at a time or something. That should also help with the memory requirements.

Are you the only one using the cluster? Perhaps someone else was using it for a high mem job and you just got in each others way?

Cheers M

amizeranschi commented 5 years ago

The data I'm using consists of 50-100 reads per FASTQ file, so it's very unlikely that this is the cause for the memory errors. Also, as far as I've noticed, nobody else was using the cluster at the time when I ran those tests.

I'll try the -joblimit option, but my guess is that the "open files" ulimit configuration is the main problem here, and I remember I've run into something similar in the past as well. Besides, this could also become problematic with other tools (e.g. Toil), which might not have options such as -joblimit. I'll ask the admin to increase the limit for open files.

Thanks again for your input.

chapmanb commented 5 years ago

Thanks for this discussion. Matthias is spot on with all his debugging, and I just wanted to mention that the fastq file size has nothing to do with resource usage here. The memory and submission requirements are due to managing your jobs. Parelleliation during variant calling can have ~2500 individual submissions to keep track of: 100 samples x 5 variant callers x 5 regions. My guess it that this overwhelms the scheduler here but it's hard to diagnose remotely as I'm not an expert on SGE scaling.

My practical suggestions would be:

Try adjusting with joblimit to avoid a large number of concurrent jobs hitting the scheduler.
Talk with your SGE admins about how larger submission sizes affect the scheduler.
You may need to tune Cromwell more than the automatic submission bcbio provides. Specifically you can use a MySQL database instead of a local filesystem based database and run in server mode. If you want to scale to larger sample sizes, these will probably be avenues to explore: https://cromwell.readthedocs.io/en/stable/

Hope this helps.

amizeranschi commented 5 years ago

OK, thanks for clarifying the situation with the memory and submission requirements.

I changed back the setting "-Xmx3g" in

https://github.com/bcbio/bcbio-nextgen/blob/62b1b79bb0145c44b8aca6fbe242cd0882685efb/bcbio/cwl/tool.py#L205

because I assumed that you're not going to take this into account anymore (see: https://github.com/bcbio/bcbio-nextgen/issues/2697#issuecomment-471885048).

I then added the option --joblimit 20 to the bcbio_vm.py cwlrun command and ran two successive tests. One managed to finish, but the second failed, due to the same errors (java.lang.OutOfMemoryError: Java heap space and java.lang.OutOfMemoryError: GC overhead limit exceeded) as the ones mentioned in https://github.com/bcbio/bcbio-nextgen/issues/2697#issuecomment-471885048.

Would it still be possible to change "-Xmx3g" to something like num_cores * memory (acc. to the values from bcbio_system.yaml) in the file bcbio/cwl/tool.py as mentioned above?

Sorry for the mess with the multiple issue reports.

chapmanb commented 5 years ago

Thanks again for all this discussion and debugging. I'm agreed that the memory adjustment based on number of running samples will help. Apologies that it took a while to have time to push this change.

I've added an update that tries to pick Cromwell memory usage based on expected parallelization of samples and variant calling methods. The specifications in bcbio_system.yaml are really meant to running of the samples so aren't directly relevant to the runner, but I tried to make a reasonable estimate by default so we don't need more configuration parameters.

Hope this helps automatically resolve the memory problems and pushes forward with your analyses. Thanks again.

amizeranschi commented 5 years ago

Thanks a lot for adding the _estimate_runner_memory() method. I ran a bcbio upgrade with -u development --tools --cwl in order to get that modification and this somehow broke bcbio_vm.py cwl such that the CWL workflow isn't being generated:

[2019-03-14T09:23Z] INFO: Using input YAML configuration: ../config/testingVC-merged.yaml
[2019-03-14T09:23Z] INFO: Checking sample YAML configuration: ../config/testingVC-merged.yaml
Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/main.py", line 12, in run
    create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 40, in from_world
    prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 422, in prep_cwl
    variables, keyvals = _flatten_samples(samples, out_file, get_retriever)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 493, in _flatten_samples
    for key in sorted(list(set(reduce(operator.add, [d.keys() for d in flat_data])))):

chapmanb commented 5 years ago

Sorry about the additional problems on upgrade. Would you be able to post the full traceback? It looks like it got cutoff here and I'm not entirely sure about the root cause of the issue. Thanks much.

amizeranschi commented 5 years ago

Yes, sorry about that, I missed the last and most important line:

Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/main.py", line 12, in run
    create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 40, in from_world
    prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 422, in prep_cwl
    variables, keyvals = _flatten_samples(samples, out_file, get_retriever)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 493, in _flatten_samples
    for key in sorted(list(set(reduce(operator.add, [d.keys() for d in flat_data])))):
TypeError: unsupported operand type(s) for +: 'dict_keys' and 'dict_keys'

amizeranschi commented 5 years ago

Note: I'm getting the same error with the minimal CWL tests as well.

$ cd /export/home/ncit/external/a.mizeranschi/test_bcbio_cwl/somatic
$ bash run_generate_cwl.sh
[2019-03-15T07:41Z] INFO: Using input YAML configuration: somatic.yaml
[2019-03-15T07:41Z] INFO: Checking sample YAML configuration: somatic.yaml
Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/main.py", line 12, in run
    create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 40, in from_world
    prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 422, in prep_cwl
    variables, keyvals = _flatten_samples(samples, out_file, get_retriever)
  File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 493, in _flatten_samples
    for key in sorted(list(set(reduce(operator.add, [d.keys() for d in flat_data])))):
TypeError: unsupported operand type(s) for +: 'dict_keys' and 'dict_keys'

chapmanb commented 5 years ago

Thanks for the full report. This was a python 3 issue that should be fixed:

https://github.com/bcbio/bcbio-nextgen/commit/1cbde3785202d1a6ec245c007b6393fd5cc443a5

so I'm not sure why you're getting an older version of the code without this. The automatic update should handle this and I pushed a new bcbio-nextgen conda package just in case, but if it doesn't you can do:

bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen

to get the latest code with this fix in place.

Thank you again for all your patience debugging. I hope this will get things going cleanly for you.

amizeranschi commented 5 years ago

I'm afraid I still get those errors (TypeError: unsupported operand type(s) for +: 'dict_keys' and 'dict_keys'), even after upgrading and manually running bcbio_conda install:

$ bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen
Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

$ bcbio_conda list | grep bcbio-nextgen
bcbio-nextgen             1.1.4a0                  pypi_0    pypi

$ bcbio_conda list -n bcbiovm | grep bcbio-nextgen
bcbio-nextgen             1.1.4a                   py36_0    bioconda
bcbio-nextgen-vm          0.1.5                    py36_0    bioconda

chapmanb commented 5 years ago

Sorry about this. I'm not sure why you're not getting the bcbio-nextgen upgrade but you can force with:

bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen --no-deps --force-reinstall

which should get you py36_3 that will have the fix. Hope this gets it working for you.

amizeranschi commented 5 years ago

I have no idea why, but that didn't do it either:

$ bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen --no-deps --force-reinstall
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm

  added / updated specs:
    - bcbio-nextgen

Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

$ bcbio_conda list -n bcbiovm | grep bcbio-nextgen
bcbio-nextgen             1.1.4a                   py36_0    bioconda
bcbio-nextgen-vm          0.1.5                    py36_0    bioconda

$ which bcbio_conda
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_conda

amizeranschi commented 5 years ago

Could this be due to pinned versions of some dependencies? I was having a look at the following page, but I'm a bit wary of running conda with the --no-pin option: https://stackoverflow.com/questions/50381335/how-to-force-conda-to-upgrade-all-packages-regardless-of-dependency-graph?answertab=active#tab-top.

chapmanb commented 5 years ago

Thanks for testing and sorry this is still stuck. It's definitely due to some kind of dependency issue, likely due to transitioning this to python 3. Apologies, I'm hoping to have the next release soon and get through all this dependency churn; there are a lot of parts getting upgraded right now and should be more stable after this.

A better way to go is to specify the subversion we want:

bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen=1.1.4a0=py36_3

which will probably identify which packages are blocking it in your environment. I tested in a couple of places and dxpy not being python 3 was doing the blocking, but removing that is okay right now.

Hope this gets you a clean upgrade and things working.

amizeranschi commented 5 years ago

Good catch, dxpy was the culprit on my system as well. Thanks again for your help, this upgrade got bcbio_vm.py cwl working again.

$ bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen=1.1.4a0=py36_3
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm

  added / updated specs:
    - bcbio-nextgen==1.1.4a0=py36_3

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bcbio-nextgen-1.1.4a       |           py36_3         2.1 MB  bioconda
    cryptography-2.6.1         |   py36h9d9f1b6_0         607 KB  conda-forge
    pycurl-7.43.0.2            |   py36h1ba5d50_0         185 KB
    ------------------------------------------------------------
                                           Total:         2.9 MB

The following packages will be REMOVED:

  dxpy-0.273.0-py36_0

The following packages will be UPDATED:

  bcbio-nextgen                               1.1.4a-py36_0 --> 1.1.4a-py36_3
  cryptography                         2.2.1-py36hdffb7b8_1 --> 2.6.1-py36h9d9f1b6_0
  curl                                    7.64.0-h646f8bb_0 --> 7.64.0-h646f8bb_2
  krb5                                 1.16.3-hc83ff2d_1000 --> 1.16.3-h05b26f9_1001
  libcurl                                 7.64.0-h01ee5af_0 --> 7.64.0-h541490c_2
  libssh2                               1.8.0-h1ad7b7a_1003 --> 1.8.0-h90d6eec_1004
  openssl                                 1.0.2r-h14c3975_0 --> 1.1.1b-h14c3975_1
  python                                3.6.7-hd21baee_1002 --> 3.6.7-h381d211_1004

The following packages will be DOWNGRADED:

  pycurl                            7.43.0.2-py36hb7f436b_0 --> 7.43.0.2-py36h1ba5d50_0

Proceed ([y]/n)? 

Downloading and Extracting Packages
pycurl-7.43.0.2      | 185 KB    | ################################################################################################################################################### | 100% 
bcbio-nextgen-1.1.4a | 2.1 MB    | ################################################################################################################################################### | 100% 
cryptography-2.6.1   | 607 KB    | ################################################################################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

bcbio / bcbio-nextgen

Bcbio jobs running with Cromwell are crashing due to high cluster load (memory? open files?) #2721