broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 589 forks source link

LocusWalker very slow when using a large interval file #2356

Open LeeTL1220 opened 7 years ago

LeeTL1220 commented 7 years ago

Included is a JProfiler screenshot of where it is spending all of its time.

As you can see below, it has not even run any of the LocusWalker.apply(...) code.

lichtens@OncobuntuMk3:~/IdeaProjects/hellbender-protected$ java -jar build/libs/gatk-protected.jar Pileup -I ~/broad_oncotator_configs/hcc_purity/SM-74NEG.bam -O ~/broad_oncotator_configs/pileup.out -L /home/lichtens/broad_oncotator_configs/allchr.1kg.phase3.v5a.snp.maf10.biallelic.recode.fixed.prune5.trim1M.test.interval_list
15:04:36.262 INFO  IntelGKLUtils - Trying to load Intel GKL library from:
    jar:file:/home/lichtens/IdeaProjects/hellbender-protected/build/libs/gatk-protected-all-0556d5b-SNAPSHOT-spark_standalone.jar!/com/intel/gkl/native/libIntelGKL.so
15:04:36.344 INFO  IntelGKLUtils - Intel GKL library loaded from classpath.
[January 25, 2017 3:04:36 PM EST] org.broadinstitute.hellbender.tools.walkers.qc.Pileup  --output /home/lichtens/broad_oncotator_configs/pileup.out --intervals /home/lichtens/broad_oncotator_configs/allchr.1kg.phase3.v5a.snp.maf10.biallelic.recode.fixed.prune5.trim1M.test.interval_list --input /home/lichtens/broad_oncotator_configs/hcc_purity/SM-74NEG.bam  --showVerbose false --outputInsertLength false --maxDepthPerSample 0 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false --disableAllReadFilters false
[January 25, 2017 3:04:36 PM EST] Executing as lichtens@OncobuntuMk3 on Linux 3.19.0-30-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_51-b16; Version: Version:0556d5b-SNAPSHOT
15:04:36.348 INFO  Pileup - Defaults.BUFFER_SIZE : 131072
15:04:36.348 INFO  Pileup - Defaults.COMPRESSION_LEVEL : 5
15:04:36.348 INFO  Pileup - Defaults.CREATE_INDEX : false
15:04:36.349 INFO  Pileup - Defaults.CREATE_MD5 : false
15:04:36.349 INFO  Pileup - Defaults.CUSTOM_READER_FACTORY : 
15:04:36.349 INFO  Pileup - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
15:04:36.349 INFO  Pileup - Defaults.NON_ZERO_BUFFER_SIZE : 131072
15:04:36.349 INFO  Pileup - Defaults.REFERENCE_FASTA : null
15:04:36.349 INFO  Pileup - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
15:04:36.349 INFO  Pileup - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:04:36.349 INFO  Pileup - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
15:04:36.350 INFO  Pileup - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:04:36.350 INFO  Pileup - Defaults.USE_CRAM_REF_DOWNLOAD : false
15:04:36.350 INFO  Pileup - Deflater IntelDeflater
15:04:36.350 INFO  Pileup - Initializing engine
WARNING: BAM index file /home/lichtens/broad_oncotator_configs/hcc_purity/SM-74NEG.bai is older than BAM /home/lichtens/broad_oncotator_configs/hcc_purity/SM-74NEG.bam
15:04:38.560 INFO  IntervalArgumentCollection - Processing 999914 bp from intervals
15:04:38.630 INFO  Pileup - Done initializing engine
15:04:38.635 INFO  ProgressMeter - Starting traversal
15:04:38.636 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute
JProfiler> Protocol version 49
JProfiler> Using JVMTI
JProfiler> JVMTI version 1.1 detected.
JProfiler> 64-bit library
JProfiler> Listening on port: 31757.
JProfiler> Attach mode initialized
JProfiler> Instrumenting native methods.
JProfiler> Can retransform classes.
JProfiler> Can retransform any class.
JProfiler> Retransforming 8 base class files.
JProfiler> Base classes instrumented.
JProfiler> Native library initialized
JProfiler> Using dynamic instrumentation
JProfiler> Time measurement: elapsed time
JProfiler> CPU profiling enabled
JProfiler> Initializing configuration.
JProfiler> Retransforming 3697 class files.
JProfiler> Configuration updated.

oncobuntu_mk3

LeeTL1220 commented 7 years ago

A little more info about the interval file:

lichtens@OncobuntuMk3:~$ head /home/lichtens/broad_oncotator_configs/allchr.1kg.phase3.v5a.snp.maf10.biallelic.recode.fixed.prune5.trim1M.test.interval_list
1:14604-14604
1:14930-14930
1:15211-15211
1:15820-15820
1:30923-30923
1:49298-49298
1:51479-51479
1:54716-54716
1:55545-55545
1:58814-58814
lichtens@OncobuntuMk3:~$ wc /home/lichtens/broad_oncotator_configs/allchr.1kg.phase3.v5a.snp.maf10.biallelic.recode.fixed.prune5.trim1M.test.interval_list
  999914   999914 21793054 /home/lichtens/broad_oncotator_configs/allchr.1kg.phase3.v5a.snp.maf10.biallelic.recode.fixed.prune5.trim1M.test.interval_list
lichtens@OncobuntuMk3:~$ 
droazen commented 7 years ago

The issue is definitely in htsjdk -- GATK is just calling queryOverlapping() a single time per file with all of the intervals, having first called QueryInterval.optimizeIntervals() on the interval list as required by the API.