broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Try to get HaplotypeCaller to run in <= 3 GB memory (or, failing that, <= 6 GB) #2591

Open droazen opened 7 years ago

droazen commented 7 years ago

A request from @eitanbanks and @yfarjoun :

"Yossi and I are just looking at our production processing costs and the HaplotypeCaller is the biggest culprit right now. That's because it currently requires these high memory machines. If we could somehow get it to use a max of 3 GB RAM then we'd cut 10% off of the entire pipeline. Even 6GB would be okay, but 3 would be huge. What do you think -- will it be possible?"

sooheelee commented 7 years ago

Does every ROI require this much memory or does the memory requirement fluctuate? If we are parallelizing runs, then at any given moment do all the sites being processed require this much memory? Can the memory across the threads be shared?

droazen commented 7 years ago

@sooheelee We're talking about peak memory usage here. If we can get the peak memory usage below certain thresholds, we can provision cheaper machines on the cloud for this part of the pipeline.

Memory across threads can be shared, yes, but not across separate processes.

sooheelee commented 7 years ago

I'd be interested in knowing the composition/characteristics of sites that peak memory use.

droazen commented 7 years ago

I think typically they are bad/repetitive regions of the genome (near the centromeres, for example) to which large numbers of reads get erroneously mapped. For HaplotypeCaller specifically, sites with large numbers of alleles / a complicated haplotype graph might also cause memory use and/or runtime to explode.

sooheelee commented 7 years ago

My understanding is that production excludes calling on the majority of such sites via their intervals list. So I'd be interested in knowing what is the fraction of these high memory sites of all the sites that go through graph assembly. Also, what fraction of these may be due to alternate haplotypes as represented by the ALT contigs in GRCh38.

droazen commented 7 years ago

@sooheelee Might be a question for @yfarjoun

eitanbanks commented 7 years ago

I'm pretty sure that we don't exclude any regions in the hg38 pipeline right now.

vdauwera commented 7 years ago

Yes we do, we run on a list of calling intervals that avoids empty/blackhole/timesuck regions.

eitanbanks commented 7 years ago

Are you talking about b37 or hg38? I thought the only things missing in hg38 are where the reference is all Ns.

vdauwera commented 7 years ago

Hg38. Maybe you're right that it's only N regions -- I haven't actually looked.

yfarjoun commented 7 years ago

That is correct. N's only (on the main contigs, not including Y and MT)

We looked into the slow regions and didn't find anything worth doing.

On Wed, Apr 26, 2017 at 9:46 PM, Eric Banks notifications@github.com wrote:

I'm pretty sure that we don't exclude any regions in the hg38 pipeline right now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/2591#issuecomment-297588284, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0hyn3fdw-iQ2Ea1260I2GrdxjjkEks5rz_NcgaJpZM4M5EDY .

sooheelee commented 7 years ago

Thanks @eitanbanks for the clarification on our calling intervals using all regions in GRCh38 excepting Ns. So the peak memory use regions may or may not correspond to regions we previously excluded for b37 using intervals. But according to @yfarjoun, the slow regions were not slow enough to exclude for GRCh38. This reminds me--I believe GRCh38 was specifically designed in part to even out high coverage pileups and soak up reads via the decoy sequences that would otherwise cause issues. So I would hypothesize that the profile of regions that are peaking memory will be different for GRCh38 than previous assemblies.

I'd be interested in confirming (i) whether peak-memory-use-regions are the same or different across samples and (ii) the distribution of the peak-memory-use-regions, e.g. 50% regions requiring 50% more memory versus 10% regions requiring 200% more memory than the mean.

droazen commented 5 years ago

@jamesemery While you're in the HaplotypeCallerEngine doing optimizations, you should profile peak memory usage as well and see if we can get it down to < 3 GB. This would reduce costs by allowing us to use cheaper instances on the cloud.

jamesemery commented 5 years ago

@droazen Its worth noting that the numbers/goals for this runtime are different now that PaPI V2 is being used more frequently. Since the maximum memory per CPU is 6.5 GB on GCS for a custom machine, that is the absolute maximum memory a task can take without having to eat the cost of adding a second core. Any memory savings we can afford beyond 6.5 GB (not just <3 GB) will still result in savings on PaPI V2 and is thus worthwhile.

Relates to #4272

droazen commented 5 years ago

@jamesemery As part of this, you should check whether cromwell has implemented auto-retry with auto-memory-doubling yet. It would be much easier to prove that we need ~3GB or less in the typical case (and rely on automatic retry for pathological cases), vs. proving that that amount is sufficient even in the worst-case scenario.

ldgauthier commented 5 years ago

Ruchi has a branch for memory retries that I haven't tried yet, but it's definitely not standard in Cromwell.