broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

profile and optimize simple variant walkers: CountVariants #1036

Closed akiezun closed 9 years ago

akiezun commented 9 years ago

the goal is to be at least same as gatk3.4 on single thread. This is for the walker version of the tool. The ticket can be split into a) profile and b) optimize if needed

Note: the GATK3.4 version is called CountRODs

The reason to do this is to see if the engine itself adds any overhead.

akiezun commented 9 years ago

Case 1: CountVariants vs CountRODs file /humgen/gsa-hpprojects/GATK/bundle/current/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf (size 1.9Gb)

GATK4 run using build/install/gatk/bin/gatk (ie not from a big jar)

running on Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 56-69 seconds (6 runs) on GATK 3.4-46-gbc02625 34-39 seconds (6 runs) on GATK 4.pre-alpha-41-ge1cafbb-SNAPSHOT

GATK3.4 has an additional ~3-6s startup/winddown time, GATK4 has an additional ~2s startup/winddown time

akiezun commented 9 years ago

on profiling, it's clear that the engine adds almost no overhead on top of htsjdk iterators - see screenshot from jprofiler image

akiezun commented 9 years ago

closing this as resolved. We win and there's no obvious badness in the profile.

akiezun commented 9 years ago

reopen - will include NFS too

akiezun commented 9 years ago

Case2: running on NFS. vcfFile /humgen/gsa-hpprojects/GATK/bundle/current/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf (size 1.9Gb) reference /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta

Running on the dataflow01 host: Linux 2.6.32-573.3.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132.

67-69 seconds (3 runs) GATK v3.4-46-gbc02625 40-41 seconds (3 runs) GATK 4.pre-alpha-45-g168cd60

example time GATK3:

real    1m13.062s
user    1m22.039s
sys 0m13.290s

example time GATK4

real    0m40.728s
user    1m38.028s
sys 0m4.842s
akiezun commented 9 years ago

case 3: bigger file on NFS file /humgen/gsa-hpprojects/GATK/bundle/current/b37/dbsnp_138.b37.vcf (10Gb) ref /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta

Running on the dataflow01 host: Linux 2.6.32-573.3.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132. 6 minutes (2 runs) GATK v3.4-46-gbc02625 4.1 (2 runs) minutes GATK 4.pre-alpha-45-g168cd60

example time GATK3

real    6m7.731s
user    6m39.912s

example time GATK4

real    4m6.494s
user    6m8.810s
akiezun commented 9 years ago

resolved