broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Profile VariantFiltration with --clusterSize and --clusterWindowSize #1129

Closed droazen closed 9 years ago

droazen commented 9 years ago

VariantFiltration in GATK4 with clustered SNP filtering on is likely to underperform GATK3, as this results in queries against the driving source of variants both before and after the current variant, and we have caching turned off for the copy of the driving datasource added to the FeatureManager for querying (as our caching strategy for features is currently only able to look ahead).

Task is to run VariantFiltration on both GATK3 and 4 with --clusterSize and --clusterWindowSize, record how much worse GATK4 performs for this use case, and (assuming it does lose to GATK3) create a beta ticket to fix it (not urgent enough for alpha).

droazen commented 9 years ago

A quick one for @akiezun

akiezun commented 9 years ago

It's bad, as expected.

Case is file on NFS /humgen/gsa-hpprojects/GATK/bundle/current/b37/CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf 1.9Gb

ref /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta

this is on dataflow01.broadinstitute.org on Linux 2.6.32-573.3.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132

Using parameters --clusterSize 3 --clusterWindowSize 100

gatk3 real 5m9.698s user 4m34.835s (GATK 3.4-46-gbc02625) gatk4 real 8m45.901s user 10m18.663s

droazen commented 9 years ago

Ok, then let's create a beta (not alpha) ticket to address this.

akiezun commented 9 years ago

done #1151