broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

ValidateVariants memory usage is high when using a gvcf as the interval list #8608

Open meganshand opened 11 months ago

meganshand commented 11 months ago

ValidateVariants requires a large amount of memory (>16Gb) to validate a GVCF when another GVCF is used as the interval list. This is not the case if a regular interval list is used instead. This comes up in the production ReblockGVCFs pipeline since we validate the reblocked GVCF using the input (unreblocked) GVCF as the interval list to validate over (with -L). For now we can just use larger memory machines to run this tool, but it is confusing to me why using a ~4Gb GVCF as an interval list would cause such a large increase in memory requirement.

meganshand commented 8 months ago

Upon further investigation (in discussion with @droazen and @lbergelson):

IntervalUtils.featureFileToIntervals holds the full interval list in memory before merging abutting intervals which becomes quite large in the GVCF as interval list case because so many very small intervals could be merged into very large intervals (or entire contigs).

We can't use IntervalMergerIterator because in GATK we can't assume the input intervals are sorted, so the full interval list has to live in memory.

Perhaps we could use an on disk sorting collection? Or do an optimistic merge even if the intervals aren't sorted and then sort and merge them later again. This would help in the GVCF as interval list case, but not provide any benefit if the input isn't sorted.

As a workaround for now, we'll add an argument to Picard's VcfToIntervalList to merge abutting intervals and add that to the command line in the ReblockGVCFs WDL.

@droazen @lbergelson Please add/clarify anything here I missed.