Closed amizeranschi closed 5 years ago
Hi,
I've had a similar experience with cromwell and some bigger jobs. In my case, I was trying to joint call about 3K exome gvcf's. Due to the splitting strategy I used I hit the ulimit
very fast. I was able to resolve this with the HPC admin though.
My suggestion for you would be to limit the number of jobs cromwell can have running at a time. You can do this using the –joblimit
param in bcbio_vm.py
, as stated here.
Also keep in mind cromwell can be very memory hungry, especially when using an in-memory db for the jobs.
Hope this helps Cheers M
@matthdsm thanks a lot for your reply.
Which particular ulimit
were you triggering during your tests? I'm guessing the open files one, is this correct?
The reason I ask is that I'm guessing this is going to become a problem for me as well when testing with regular-sized data (exome- and whole-genome-seq) instead of the small files that I'm now using for testing. I'll have to get in touch with the cluster admin, but I first need to figure out what to ask for, specifically.
Thanks for mentioning the -joblimit
option as well, I never noticed it when first going through that bit of the documentation.
And regarding the memory usage of Cromwell, by any chance, are you referring to the cromwell_work/persist/metadata.lobs
file? It takes up 36GB of space for the small 100 FASTQ files that I was using. Does all of this need to stay in memory for each of the Cromwell jobs, or only for the controller? I have a second issue open about this (https://github.com/bcbio/bcbio-nextgen/issues/2722) and I was wondering if these problems were somehow related.
The ulimit for my account had to be raised to about 1M. This is quite high, but then again so was the amount of data I was working with. I don't think you're going to have to go so extreme.
I also don't think the size of your data (big fastq vs small fastq) is going to impact the memory usage. Cromwell keeps references to your files in memory, not the files itself. If I remember correctly, I also played around a bit with some of the config options for cromwell (symlink vs hard links), but I'm not really sure, it's been a while.
In your case, I'd start by playing around with the joblimit param and see how far that gets you. You can also consider splitting your data in half and analyzing your fq's 50 at a time or something. That should also help with the memory requirements.
Are you the only one using the cluster? Perhaps someone else was using it for a high mem job and you just got in each others way?
Cheers M
The data I'm using consists of 50-100 reads per FASTQ file, so it's very unlikely that this is the cause for the memory errors. Also, as far as I've noticed, nobody else was using the cluster at the time when I ran those tests.
I'll try the -joblimit
option, but my guess is that the "open files" ulimit configuration is the main problem here, and I remember I've run into something similar in the past as well. Besides, this could also become problematic with other tools (e.g. Toil), which might not have options such as -joblimit
. I'll ask the admin to increase the limit for open files.
Thanks again for your input.
Thanks for this discussion. Matthias is spot on with all his debugging, and I just wanted to mention that the fastq file size has nothing to do with resource usage here. The memory and submission requirements are due to managing your jobs. Parelleliation during variant calling can have ~2500 individual submissions to keep track of: 100 samples x 5 variant callers x 5 regions. My guess it that this overwhelms the scheduler here but it's hard to diagnose remotely as I'm not an expert on SGE scaling.
My practical suggestions would be:
Hope this helps.
OK, thanks for clarifying the situation with the memory and submission requirements.
I changed back the setting "-Xmx3g"
in
because I assumed that you're not going to take this into account anymore (see: https://github.com/bcbio/bcbio-nextgen/issues/2697#issuecomment-471885048).
I then added the option --joblimit 20
to the bcbio_vm.py cwlrun
command and ran two successive tests. One managed to finish, but the second failed, due to the same errors (java.lang.OutOfMemoryError: Java heap space
and java.lang.OutOfMemoryError: GC overhead limit exceeded
) as the ones mentioned in https://github.com/bcbio/bcbio-nextgen/issues/2697#issuecomment-471885048.
Would it still be possible to change "-Xmx3g"
to something like num_cores * memory
(acc. to the values from bcbio_system.yaml
) in the file bcbio/cwl/tool.py
as mentioned above?
Sorry for the mess with the multiple issue reports.
Thanks again for all this discussion and debugging. I'm agreed that the memory adjustment based on number of running samples will help. Apologies that it took a while to have time to push this change.
I've added an update that tries to pick Cromwell memory usage based on expected parallelization of samples and variant calling methods. The specifications in bcbio_system.yaml
are really meant to running of the samples so aren't directly relevant to the runner, but I tried to make a reasonable estimate by default so we don't need more configuration parameters.
Hope this helps automatically resolve the memory problems and pushes forward with your analyses. Thanks again.
Thanks a lot for adding the _estimate_runner_memory() method. I ran a bcbio upgrade with -u development --tools --cwl
in order to get that modification and this somehow broke bcbio_vm.py cwl
such that the CWL workflow isn't being generated:
[2019-03-14T09:23Z] INFO: Using input YAML configuration: ../config/testingVC-merged.yaml
[2019-03-14T09:23Z] INFO: Checking sample YAML configuration: ../config/testingVC-merged.yaml
Traceback (most recent call last):
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_vm.py", line 354, in <module>
args.func(args)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/main.py", line 12, in run
create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 40, in from_world
prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 422, in prep_cwl
variables, keyvals = _flatten_samples(samples, out_file, get_retriever)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 493, in _flatten_samples
for key in sorted(list(set(reduce(operator.add, [d.keys() for d in flat_data])))):
Sorry about the additional problems on upgrade. Would you be able to post the full traceback? It looks like it got cutoff here and I'm not entirely sure about the root cause of the issue. Thanks much.
Yes, sorry about that, I missed the last and most important line:
Traceback (most recent call last):
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_vm.py", line 354, in <module>
args.func(args)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/main.py", line 12, in run
create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 40, in from_world
prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 422, in prep_cwl
variables, keyvals = _flatten_samples(samples, out_file, get_retriever)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 493, in _flatten_samples
for key in sorted(list(set(reduce(operator.add, [d.keys() for d in flat_data])))):
TypeError: unsupported operand type(s) for +: 'dict_keys' and 'dict_keys'
Note: I'm getting the same error with the minimal CWL tests as well.
$ cd /export/home/ncit/external/a.mizeranschi/test_bcbio_cwl/somatic
$ bash run_generate_cwl.sh
[2019-03-15T07:41Z] INFO: Using input YAML configuration: somatic.yaml
[2019-03-15T07:41Z] INFO: Checking sample YAML configuration: somatic.yaml
Traceback (most recent call last):
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_vm.py", line 354, in <module>
args.func(args)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/main.py", line 12, in run
create.from_world(world, run_info_yaml, integrations=integrations, add_container_tag=args.add_container_tag)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 40, in from_world
prep_cwl(samples, workflow_fn, out_dir, out_file, integrations, add_container_tag=add_container_tag)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 422, in prep_cwl
variables, keyvals = _flatten_samples(samples, out_file, get_retriever)
File "/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm/lib/python3.6/site-packages/bcbio/cwl/create.py", line 493, in _flatten_samples
for key in sorted(list(set(reduce(operator.add, [d.keys() for d in flat_data])))):
TypeError: unsupported operand type(s) for +: 'dict_keys' and 'dict_keys'
Thanks for the full report. This was a python 3 issue that should be fixed:
https://github.com/bcbio/bcbio-nextgen/commit/1cbde3785202d1a6ec245c007b6393fd5cc443a5
so I'm not sure why you're getting an older version of the code without this. The automatic update should handle this and I pushed a new bcbio-nextgen conda package just in case, but if it doesn't you can do:
bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen
to get the latest code with this fix in place.
Thank you again for all your patience debugging. I hope this will get things going cleanly for you.
I'm afraid I still get those errors (TypeError: unsupported operand type(s) for +: 'dict_keys' and 'dict_keys'
), even after upgrading and manually running bcbio_conda install
:
$ bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen
Collecting package metadata: done
Solving environment: done
# All requested packages already installed.
$ bcbio_conda list | grep bcbio-nextgen
bcbio-nextgen 1.1.4a0 pypi_0 pypi
$ bcbio_conda list -n bcbiovm | grep bcbio-nextgen
bcbio-nextgen 1.1.4a py36_0 bioconda
bcbio-nextgen-vm 0.1.5 py36_0 bioconda
Sorry about this. I'm not sure why you're not getting the bcbio-nextgen upgrade but you can force with:
bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen --no-deps --force-reinstall
which should get you py36_3 that will have the fix. Hope this gets it working for you.
I have no idea why, but that didn't do it either:
$ bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen --no-deps --force-reinstall
Collecting package metadata: done
Solving environment: done
## Package Plan ##
environment location: /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm
added / updated specs:
- bcbio-nextgen
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
$ bcbio_conda list -n bcbiovm | grep bcbio-nextgen
bcbio-nextgen 1.1.4a py36_0 bioconda
bcbio-nextgen-vm 0.1.5 py36_0 bioconda
$ which bcbio_conda
/export/home/ncit/external/a.mizeranschi/bcbio_nextgen/tools/bin/bcbio_conda
Could this be due to pinned versions of some dependencies? I was having a look at the following page, but I'm a bit wary of running conda with the --no-pin
option: https://stackoverflow.com/questions/50381335/how-to-force-conda-to-upgrade-all-packages-regardless-of-dependency-graph?answertab=active#tab-top.
Thanks for testing and sorry this is still stuck. It's definitely due to some kind of dependency issue, likely due to transitioning this to python 3. Apologies, I'm hoping to have the next release soon and get through all this dependency churn; there are a lot of parts getting upgraded right now and should be more stable after this.
A better way to go is to specify the subversion we want:
bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen=1.1.4a0=py36_3
which will probably identify which packages are blocking it in your environment. I tested in a couple of places and dxpy not being python 3 was doing the blocking, but removing that is okay right now.
Hope this gets you a clean upgrade and things working.
Good catch, dxpy
was the culprit on my system as well. Thanks again for your help, this upgrade got bcbio_vm.py cwl
working again.
$ bcbio_conda install -n bcbiovm -c conda-forge -c bioconda bcbio-nextgen=1.1.4a0=py36_3
Collecting package metadata: done
Solving environment: done
## Package Plan ##
environment location: /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/anaconda/envs/bcbiovm
added / updated specs:
- bcbio-nextgen==1.1.4a0=py36_3
The following packages will be downloaded:
package | build
---------------------------|-----------------
bcbio-nextgen-1.1.4a | py36_3 2.1 MB bioconda
cryptography-2.6.1 | py36h9d9f1b6_0 607 KB conda-forge
pycurl-7.43.0.2 | py36h1ba5d50_0 185 KB
------------------------------------------------------------
Total: 2.9 MB
The following packages will be REMOVED:
dxpy-0.273.0-py36_0
The following packages will be UPDATED:
bcbio-nextgen 1.1.4a-py36_0 --> 1.1.4a-py36_3
cryptography 2.2.1-py36hdffb7b8_1 --> 2.6.1-py36h9d9f1b6_0
curl 7.64.0-h646f8bb_0 --> 7.64.0-h646f8bb_2
krb5 1.16.3-hc83ff2d_1000 --> 1.16.3-h05b26f9_1001
libcurl 7.64.0-h01ee5af_0 --> 7.64.0-h541490c_2
libssh2 1.8.0-h1ad7b7a_1003 --> 1.8.0-h90d6eec_1004
openssl 1.0.2r-h14c3975_0 --> 1.1.1b-h14c3975_1
python 3.6.7-hd21baee_1002 --> 3.6.7-h381d211_1004
The following packages will be DOWNGRADED:
pycurl 7.43.0.2-py36hb7f436b_0 --> 7.43.0.2-py36h1ba5d50_0
Proceed ([y]/n)?
Downloading and Extracting Packages
pycurl-7.43.0.2 | 185 KB | ################################################################################################################################################### | 100%
bcbio-nextgen-1.1.4a | 2.1 MB | ################################################################################################################################################### | 100%
cryptography-2.6.1 | 607 KB | ################################################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Hello,
I'm getting intermittent errors when running tests with 100 small FASTQ files on an SGE cluster using Cromwell. The jobs are sometimes able to finish fine, but more often than not (4 out of the last 5 tries), they are crashing with errors such as the one below.
Several pages mention a problem with "cluster load" and memory getting used up was mentioned, but could this also be a problem with hitting the limit on simultaneously open files, rather than memory?
The contents of the referenced stderr file is the following:
I found a seemingly related error report (https://github.com/aws/aws-parallelcluster/issues/26) where someone mentioned that this could be a problem with the cluster load. A different page (https://www.t3.gsic.titech.ac.jp/en/node/176) confirms this:
A bit more googling for this error came up with a report mentioning that a cause could be that RAM is getting used up on the master node: https://github.com/FCP-INDI/C-PAC/issues/13. The node that I'm using as master node for these jobs has 128 GB of RAM, so it seems a bit surprising that scheduling 100 small samples could use up all that amount of memory.
My tests involve variant calling with 5 tools (GATK, freebayes, samtools, varscan and platypus). Given that the FASTQ files are small, perhaps the VC jobs are being scheduled too fast for the master node to be able to handle?
In case it's relevant, here is the output of
ulimit -a
from one of the compute nodes. Could the open files limit (1024
) be the cause of these errors?