AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
129 stars 56 forks source link

Improving memory usage over small regions #64

Closed chapmanb closed 7 years ago

chapmanb commented 7 years ago

Zaal, Zhongwu and all; We're seeing some cases where VarDict has higher than expected memory usage on relatively small regions. Here is a self contained example over a 1.5Mb region where VarDict memory usage spikes at ~6Gb:

wget https://s3.amazonaws.com/chapmanb/testcases/vardict_debug_memory.tar.gz

We've been seeing issues where these type of memory spikes cause schedulers to kill jobs, requiring allocating large amounts of memory (~10Gb+/core) to VarDict jobs. Allocating this much memory runs less VarDict jobs per machine to work around the issue, but is sub-optimal since it reduces the total number of jobs run concurrently.

Do you have any insight into why memory usage gets so high over this region? Are there any tweaks we can do to VarDict, or region selection, to avoid this? Thanks for any thoughts or directions to pursue.

ghost commented 7 years ago

@chapmanb ok, we will try to find a solution asap.

mjafin commented 7 years ago

Might be worth doing a quick check with realignment switched off if that is the reason

chapmanb commented 7 years ago

Miika -- thanks for the suggestion. It doesn't look like realignment is the culprit here as it still uses similar amounts of memory with -k 0 set. Happy to try other command line tweaks if it would help.

nkarulin commented 7 years ago

Hello Brad! VarDict memory usage basically depends on the region size and bam coverage of the region, but not on the number of regions. VarDict keeps in memory all variations contributed by reads in particular region. When a region is processed, VarDict outputs all variations and the memory can be freed. In the given test data there are regions with size of 1 mln bp, which demands quite a lot of memory, since VarDict stores all variations in memory untill the region processing is completed. As an experiment, we have splitted the 1 mln region into 1 thousand chunks; we see that VarDict consumes a moderate amount of memory and outputs the same results on given data (except the region information).

Memory usage of huge regions: was

Splitted into small ones: now

Following small awk program can help to split huge regions till given MAX_REG_SIZE.

#splitter.awk
BEGIN {
    MAX_REG_SIZE = 10000
}
{
    if ($3 - $2 < MAX_REG_SIZE) {
    printf("%d\t%d\t%d\n", $1, $2, $3)
    } else {
        a = $2
        b = $2 + MAX_REG_SIZE
        printf("%d\t%d\t%d\n", $1, a, b)
        while (b < $3) {
            a = a + MAX_REG_SIZE
            b = b + MAX_REG_SIZE
            if (b > $3) {
                b = $3
            }
            printf("%d\t%d\t%d\n", $1, a, b)
        }
    }
}
END{}

Usage: awk -f splitter.awk with_huge_regions.bed > splitted_regions.bed

chapmanb commented 7 years ago

Nikolai; Brilliant analysis, thanks for looking at this in depth. That does give us a pretty clear path to having better memory usage. Is it possible to have VarDict recognize when it can free memory in larger regions and do this automatically? The only issue with chunking up the BED files further is that we risk breaking through an Indel, which I'm worried could result in either losing the indel or reporting it twice. If VarDict could recognize when it comes to a region without any variants and could report and clear memory then, that would avoid this potential edge case. I'm not sure how do-able that is code wise so feel free to tell me it won't work. I'm just brainstorming on more general solutions to improve VarDict usability.

mjafin commented 7 years ago

@chapmanb I spoke to @zhongwulai, and in this case it seems we could have overlapping regions (to avoid missing indels) in the initial call to VarDict(Java). For the regions of a single call to VarDict, var2vcf_valid.pl will emit only one copy of the variant.

mjafin commented 7 years ago

That said, when using the -A option will print out the duplicates. Probably unwanted behaviour but a separate issue. EDIT: Zhongwu will push a fix for this shortly

mjafin commented 7 years ago

The short term fix could be just using short overlapping regions in the bed file combined with the fix Zhongwu just pushed https://github.com/AstraZeneca-NGS/VarDict/commit/71223e2e7808d11020facf2cc4deb9e346442593

chapmanb commented 7 years ago

Nikolai, Miika and Zhongwu; Thanks again for all this analysis and the VarDict fixes. I pushed a version of bcbio that uses smaller chunk sizes to reduce memory from ~5Gb/max to ~1Gb/max based on Nikolai's investigation above. Happy to have better overall memory usage in place, thank you again.