broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Profile and optimize VariantFiltration #1035

Closed akiezun closed 9 years ago

akiezun commented 9 years ago

the goal is to be at least same as gatk3.4 on single thread. This is for the walker version of the tools. The ticket can be split into a) profile and b) optimize if needed

akiezun commented 9 years ago

Case 1: VariantFiltration (-filter 'DP > 100') file /humgen/gsa-hpprojects/GATK/bundle/current/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf (1.9 Gb)

running on Mac laptop with SSD. Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17. 92-115 seconds (3 runs) on GATK 4.pre-alpha-41-ge1cafbb-SNAPSHOT 103-128 seconds (3 runs) on GATK 3.4-46-gbc02625

akiezun commented 9 years ago

Here's a profile. It's clear that all time goes into reading and writing and almost no overhead comes from the engine. Closing this - we win and no obvious problems in the profile. image

akiezun commented 9 years ago

reopen - will look at NFS too

akiezun commented 9 years ago

case 2 file on NFS /humgen/gsa-hpprojects/GATK/bundle/current/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf 1.9Gb ref /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta

Linux 2.6.32-573.3.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132. 2.9 - 3 minutes (2 runs) v3.4-46-gbc02625 2.2 minutes (2 runs) GATK 4.pre-alpha-45-g168cd60

time GATK3

real    3m3.714s
user    3m52.474s

time GATK4

real    2m14.264s
user    3m17.439s
akiezun commented 9 years ago

resolving

akiezun commented 9 years ago

but see #1129