Open meganshand opened 11 months ago
Upon further investigation (in discussion with @droazen and @lbergelson):
IntervalUtils.featureFileToIntervals
holds the full interval list in memory before merging abutting intervals which becomes quite large in the GVCF as interval list case because so many very small intervals could be merged into very large intervals (or entire contigs).
We can't use IntervalMergerIterator
because in GATK we can't assume the input intervals are sorted, so the full interval list has to live in memory.
Perhaps we could use an on disk sorting collection? Or do an optimistic merge even if the intervals aren't sorted and then sort and merge them later again. This would help in the GVCF as interval list case, but not provide any benefit if the input isn't sorted.
As a workaround for now, we'll add an argument to Picard's VcfToIntervalList
to merge abutting intervals and add that to the command line in the ReblockGVCFs
WDL.
@droazen @lbergelson Please add/clarify anything here I missed.
ValidateVariants
requires a large amount of memory (>16Gb) to validate a GVCF when another GVCF is used as the interval list. This is not the case if a regular interval list is used instead. This comes up in the productionReblockGVCFs
pipeline since we validate the reblocked GVCF using the input (unreblocked) GVCF as the interval list to validate over (with-L
). For now we can just use larger memory machines to run this tool, but it is confusing to me why using a ~4Gb GVCF as an interval list would cause such a large increase in memory requirement.