Open fo40225 opened 3 months ago
Do you really need almost 2 terabytes of heap space?
-Xmx1794G
This is probably what is killing your process. Besides we don't recommend using experimental spark tools for production or research purposes unless we say that it is fine to do so.
You've misunderstood the issue. My computer has 2TB of memory, so -Xmx1794G
is not the cause of the problem.
When using the original BAM file (447GB) as input, SortSamSpark
runs successfully.
However, when using the BAM file filtered with samtools view -e 'length(seq)>=10000'
(434GB) as input, SortSamSpark
crashes.
The file used is a test file, not for production or research purposes.
I'm reporting this issue in the hope of improving GATK.
@fo40225 that's interesting. The original file contains the reads that cause the filtered file to fail? I would have expected it to fail in both cases if it was an issue with the read lengths.
@gokalpcelik is correct about the Spark tools - they're Beta
/ Experimental
tools, so we don't expect them to be stable on all inputs. You're probably running into an edge case we haven't seen before.
This is definitely a bug with the way serialization is handled, but it's hard to tell where the issue is exactly. Spark is trying to serialize something into a byte buffer, but it's trying to put more bytes in than fit in a java array. If you could produce a very small bam file that reliably reproduces this problem we might be able to investigate it, but I don't have bandwidth to really look into this right now. Spark tools are a low priority at the moment. I would recommend sorting the file with the non-spark SortSam for now. I'm sorry I don't have a better answer, but dealing with serialization issues is very often a huge can of worms.
@jonn-smith The original BAM (containing short reads) will run normally. The filtered BAM (containing only long reads) will crash.
@lbergelson Is there a way to keep the file in --conf spark.local.dir=./tmp
? Perhaps I can extract a minimal bam file that reliably reproduces this problem from it.
Bug Report
Affected tool(s) or class(es)
Tool/class name(s), special parameters?
SortSamSpark --sort-order coordinate
Affected version(s)
4.4.0.0
Description
Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.
An error occurs when using SortSamSpark to sort the large BAM file that contain long reads only (90x human wgs, min. read length>10kbp). However, if the large BAM file contains short reads, it executes normally.
Steps to reproduce
Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem. (The support team may follow up to ask you to upload data to reproduce the issue.)
Expected behavior
Tell us what should happen
Output a sorted BAM file.
Actual behavior
Tell us what happens instead
java.lang.OutOfMemoryError: Required array length ? is too large
The last lines of the log file.
The first lines of the log file: