elixir-no-nels / rbFlow-Germline

A workflow engine with a germline calling pipeline running in a container
MIT License
0 stars 0 forks source link

Decide on choice of GATK version for the production release, 3.8, 4.0.2.0 or 4.0.8.1 #11

Closed oskarvid closed 6 years ago

oskarvid commented 6 years ago

Reason to go with 3.8: Old and trusted. Reason to go with 4.0.2.0: Has been thoroughly tested, pipeline runs fine with 16 cores and 16GB RAM. Reason to go with 4.0.8.1: Bug fixes and added features. Testing is in progress, current results with test_R*.fastq.gz files indicate issues with RAM usage, 8 threads and 40GB RAM is so far the least number of cores and RAM that has been verified to work for HaplotypeCaller. These results are ridiculous and a full sized NA12878 test is in progress to see if this is reproduced with simulated real data.

oskarvid commented 6 years ago

The 4.0.8.1 version has regressed in terms of RAM requirements for HaplotypeCaller. The 4.0.2.0 version can run the entire pipeline with 16 threads and 16GB RAM, but the 4.0.8.1 cannot.

If it is important to use the 4.0.8.1 version I see no other option than to increase the minimum required RAM for at least that tool, and because we run the pipeline on one single node on Colossus it means we need a VM preferably with 64GB RAM. If we take that path we can use e.g 60GB for all tools and possibly shave off a couple of hours of total runtime. As a side note, the only known amount of RAM I have managed to run HaplotypeCaller with with this version is 40GB, so we might as well go big or go home if this will be the GATK version we use going forward.

Otherwise we can go back to the latest known stable version, or if it is really important that we have the most recent version that can handle 16GB RAM and 16 threads, I will spend more time testing more recent versions than 4.0.2.0.

I suggest we focus on delivering the pipeline and try to upgrade things for the next release. This testing can easily take more than one work day which inevitably delays things further.

oskarvid commented 6 years ago

I have now tested the following bioconda GATK 4 versions: 4.0.2.0, 4.0.2.1, 4.0.3.0, 4.0.4.0, 4.0.5.1, 4.0.5.2, 4.0.6.0, 4.0.7.0, 4.0.8.1, and the only ones that work with 16 cores and 16GB RAM are 4.0.2.0 and 4.0.2.1.

Either we increase the minimum required amount of RAM to 64GB or go with 4.0.2.1 and stay with 16GB required RAM. I suggest we stop exploring new features, new versions and new things and focus on delivering the first milestone now.

oskarvid commented 6 years ago

A full single sample test run with all NA12878 files, GATK version 4.0.8.1, 16 threads and 55GB RAM is currently ongoing on UH-Sky. Once it runs flawlessly I will commit the code to this repository and then move it over to TSD and run a test there too.

GhisF commented 6 years ago

From 4.0.2.0 to 4.0.6.0 release notes mention many bug fixes for HaplotypesCaller. I think we should consider to use a GATK version >= 4.0.6.0

Haplotypes caller is also significantly faster when running with the --new-qual argument.

oskarvid commented 6 years ago

4.0.8.1 has been verified to work on both UH-Sky and Colossus on TSD using 16 threads and 55GB on UH-Sky and 16 threads and 50GB on Colossus. I tried using 55GB on Colossus too but this caused the total RAM usage to exceed the requested node limit of 60GB and that causes slurm to kill the job.

And with that I will close this issue.