Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

java.lang.StackOverflowError #2

Closed cagaser closed 4 years ago

cagaser commented 4 years ago

Hello,

I'm running the tool as follow:

JAR_FILE="UMICollapse/umicollapse.jar"
inBam="sample-AF357425-High.bam"
outBam="/sample-AF357425-High.dedup.bam"

source miniconda3/bin/activate openjdk
java -Xmx115G \
    -jar $JAR_FILE bam \
    -i ${inBam} \
    -o ${outBam}

But I'm getting the following error:

Exception in thread "main" java.lang.StackOverflowError
        at java.base/java.util.HashMap.hash(HashMap.java:339)
        at java.base/java.util.HashMap.put(HashMap.java:607)
        at java.base/java.util.HashSet.add(HashSet.java:220)
        at umicollapse.data.NgramBKTree.recursiveRemoveNearBKTree(NgramBKTree.java:70)
        at umicollapse.data.NgramBKTree.recursiveRemoveNearBKTree(NgramBKTree.java:85)
        at umicollapse.data.NgramBKTree.recursiveRemoveNearBKTree(NgramBKTree.java:85)
        at umicollapse.data.NgramBKTree.recursiveRemoveNearBKTree(NgramBKTree.java:85)
Daniel-Liu-c0deb0t commented 4 years ago

The BK-trees are becoming extremely deep, so the stack will overflow from the recursive calls for traversing the BK-trees. This should be able to be solved by increasing the stack size (I see you have already increased the memory size) by using -Xss1G or something larger/smaller depending on your task/available memory. The default stack size is not suitable for most tasks, especially when you handling extremely large files. Of course, there could be an infinite loop bug, but we will only know that if you try the program with a larger stack size.

If memory size becomes an issue, but a (probably much) slower speed is fine, try --data ngram (no tree pointers overhead) or --data bktree (UMIs are not duplicated) instead.

I am curious though, how many reads/unique UMIs do you have? It seems like an extremely large amount. Were other tools able to handle that data?

cagaser commented 4 years ago

Hi, thank you for your reply.

I just checked the number of unique UMI's and I have around 4.3 million. I was using JE-suite and UMI-tools as well. However, I was having the same issue with high memory footprint. So, I was hoping I could try this tool. When I was using UMI-tools, the tool stopped at one location with roughly 4600 reads. I was using 180G memory back then.

cagaser commented 4 years ago

The tool is running perfectly after using --data ngram and -Xss1G. Thank you very much for your help!

Daniel-Liu-c0deb0t commented 4 years ago

Hey, can you please try it with just -Xss1G and no --data ngram, so it is using the default n-gram BK-trees data structure? I want to see if there is still a stack overflow error, because that may indicate a bug in the code. For my experiments, I have deduplicated 1 million+ UMIs with just the 16 GB of memory on my laptop.

cagaser commented 4 years ago

Hey Daniel; Just did and it's working aswell without --data ngram

Daniel-Liu-c0deb0t commented 4 years ago

Alright. Glad it works.