Closed chapmanb closed 7 years ago
@chapmanb ok, we will try to find a solution asap.
Might be worth doing a quick check with realignment switched off if that is the reason
Miika -- thanks for the suggestion. It doesn't look like realignment is the culprit here as it still uses similar amounts of memory with -k 0
set. Happy to try other command line tweaks if it would help.
Hello Brad! VarDict memory usage basically depends on the region size and bam coverage of the region, but not on the number of regions. VarDict keeps in memory all variations contributed by reads in particular region. When a region is processed, VarDict outputs all variations and the memory can be freed. In the given test data there are regions with size of 1 mln bp, which demands quite a lot of memory, since VarDict stores all variations in memory untill the region processing is completed. As an experiment, we have splitted the 1 mln region into 1 thousand chunks; we see that VarDict consumes a moderate amount of memory and outputs the same results on given data (except the region information).
Memory usage of huge regions:
Splitted into small ones:
Following small awk program can help to split huge regions till given MAX_REG_SIZE.
#splitter.awk
BEGIN {
MAX_REG_SIZE = 10000
}
{
if ($3 - $2 < MAX_REG_SIZE) {
printf("%d\t%d\t%d\n", $1, $2, $3)
} else {
a = $2
b = $2 + MAX_REG_SIZE
printf("%d\t%d\t%d\n", $1, a, b)
while (b < $3) {
a = a + MAX_REG_SIZE
b = b + MAX_REG_SIZE
if (b > $3) {
b = $3
}
printf("%d\t%d\t%d\n", $1, a, b)
}
}
}
END{}
Usage:
awk -f splitter.awk with_huge_regions.bed > splitted_regions.bed
Nikolai; Brilliant analysis, thanks for looking at this in depth. That does give us a pretty clear path to having better memory usage. Is it possible to have VarDict recognize when it can free memory in larger regions and do this automatically? The only issue with chunking up the BED files further is that we risk breaking through an Indel, which I'm worried could result in either losing the indel or reporting it twice. If VarDict could recognize when it comes to a region without any variants and could report and clear memory then, that would avoid this potential edge case. I'm not sure how do-able that is code wise so feel free to tell me it won't work. I'm just brainstorming on more general solutions to improve VarDict usability.
@chapmanb I spoke to @zhongwulai, and in this case it seems we could have overlapping regions (to avoid missing indels) in the initial call to VarDict(Java). For the regions of a single call to VarDict, var2vcf_valid.pl
will emit only one copy of the variant.
That said, when using the -A
option will print out the duplicates. Probably unwanted behaviour but a separate issue. EDIT: Zhongwu will push a fix for this shortly
The short term fix could be just using short overlapping regions in the bed file combined with the fix Zhongwu just pushed https://github.com/AstraZeneca-NGS/VarDict/commit/71223e2e7808d11020facf2cc4deb9e346442593
Nikolai, Miika and Zhongwu; Thanks again for all this analysis and the VarDict fixes. I pushed a version of bcbio that uses smaller chunk sizes to reduce memory from ~5Gb/max to ~1Gb/max based on Nikolai's investigation above. Happy to have better overall memory usage in place, thank you again.
Zaal, Zhongwu and all; We're seeing some cases where VarDict has higher than expected memory usage on relatively small regions. Here is a self contained example over a 1.5Mb region where VarDict memory usage spikes at ~6Gb:
We've been seeing issues where these type of memory spikes cause schedulers to kill jobs, requiring allocating large amounts of memory (~10Gb+/core) to VarDict jobs. Allocating this much memory runs less VarDict jobs per machine to work around the issue, but is sub-optimal since it reduces the total number of jobs run concurrently.
Do you have any insight into why memory usage gets so high over this region? Are there any tweaks we can do to VarDict, or region selection, to avoid this? Thanks for any thoughts or directions to pursue.