Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

Memory issues #4

Closed ju-mu closed 3 years ago

ju-mu commented 3 years ago

Hi,

Thanks a lot for this valuable tool.

I have successfully applied umicollapse bam to some smaller datasets and I am now trying to deduplicate a series of WES datasets without success. The datasets have about 200 million 150Bp PE reads and 1 million unique 10Bp long UMIs.

My machine has 512Gb memory and 176 cores and no matter which combination of java configurations I try (also tried -XX:+UseSerialGC), I run out of memory after a few hours runtime: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at htsjdk.samtools.BAMRecord.decodeBaseQualities(BAMRecord.java:426) at htsjdk.samtools.BAMRecord.getBaseQualities(BAMRecord.java:382) at umicollapse.util.SAMRead.(SAMRead.java:18) at umicollapse.main.DeduplicateSAM.deduplicateAndMerge(DeduplicateSAM.java:57) at umicollapse.main.Main.main(Main.java:169)

I am running umicollapse as follows:

/usr/lib/jvm/java-11-openjdk-11.0.8.10-1.el7.x86_64/bin/java -server -Xmx100G -Xms100G -Xss1G -jar umicollapse.jar $@ bam -i x_sorted.bam -o x_ucoll.bam -k 1 --algo dir --merge avgqual

I have also tried to use other data structures such as ngram and different deduplication algorithms without luck, substitution edits were always held at 1.

Of note, the datasets have some UMIs containing degenerate bases ("N"), could this be a problem / how are they handled?

Many thanks

Daniel-Liu-c0deb0t commented 3 years ago

Hey, I don't see anything immediately wrong with how the tool is ran. However, 1 million UMIs should really be quite easy to handle for this tool, so this is a strange case I'll have to look into. The real problem here is that the deduplication code isn't even reached (this error is part of reading in the file and putting it in a hash table). The only issue I can think of for running out of memory is if htsjdk BAMRecords take up too much space, but not all 200 million reads are kept in memory (only the "best" read per unique UMI).

It is fine to have degenerate bases as long as the bases are ATCGN, as other bases are not supported right now. The behavior for N is to just treat it as being different from all the other bases. If there are any problems with that, there would be a different error.

Can you set -Xms100G to -Xms1G and see how much the memory allocation grows as it runs?

ju-mu commented 3 years ago

Thanks for your quick feedback!

I set Xms to 1G and started another attempt on a slightly smaller sample. Here is what happened:

Maybe it is related to this specific set of samples, as there does not seem to be a lot of duplication in this sample, umitools group output

Input Reads: 78580950, Read pairs: 78580950, Chimeric read pair: 473219, Read 2 unmapped: 228112, Read 1 unmapped: 13375
Number of reads out: 156952943, Number of groups: 68289445
Total number of positions deduplicated: 67679688
Mean number of unique UMIs per position: 1.01
Max. number of unique UMIs per position: 16

...and here is a histogram of the final position unique UMI family sizes:

1: 59687689 2: 15097086 3: 2767464 4: 456504 5: 71675 6: 11778 7: 1799 8: 288 9: 63 10: 10

So if htsjdk BAMRecord keeps the best read per unique UMI and position, it would probably still keep most reads in memory?

Browsing through the log files on the cluster, I noticed that errors can vary slightly, probably dependent on java memory configuration, e.g.:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301) at htsjdk.samtools.BAMRecord.decodeReadName(BAMRecord.java:440) at htsjdk.samtools.BAMRecord.getReadName(BAMRecord.java:252) at umicollapse.util.SAMRead.getUMI(SAMRead.java:34) at umicollapse.main.DeduplicateSAM.deduplicateAndMerge(DeduplicateSAM.java:58) at umicollapse.main.Main.main(Main.java:169)

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at umicollapse.main.DeduplicateSAM.deduplicateAndMerge(DeduplicateSAM.java:57) at umicollapse.main.Main.main(Main.java:169)

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at htsjdk.samtools.BAMRecord.decodeBaseQualities(BAMRecord.java:426) at htsjdk.samtools.BAMRecord.getBaseQualities(BAMRecord.java:382) at umicollapse.util.SAMRead.(SAMRead.java:18) at umicollapse.main.DeduplicateSAM.deduplicateAndMerge(DeduplicateSAM.java:57) at umicollapse.main.Main.main(Main.java:169)

Daniel-Liu-c0deb0t commented 3 years ago

Hey, thank you for giving me so much information! Sorry for the wait, I've been working on this problem.

I am not sure why it is taking up so many threads, as by default it should be single-threaded. Maybe the JVM GC is using extra threads.

I did some profiling on smaller examples and I believe that storing all the reads in memory is too inefficient. Therefore, I created a new option called --two-pass. If you specify this, then instead of reading all of the reads into memory, it will deduplicate UMIs for a single alignment coordinate as soon as all of the reads for that alignment coordinate are read. This should save a lot of memory, at the cost of a little bit of speed due to the two passes necessary to do this. The downside to this approach is that the input SAM/BAM file should be approximately sorted (not a strict requirement) based on alignment position, so then reads with the same alignment coordinate are located at around the same spot in the file. Then, we don't have to keep those reads around for a long time and we can deduplicate and free them as soon as we know there are no more reads for an alignment coordinate.

I am kind of disappointed that UMI-tools (written in Python) was somehow able to handle this better than Java, which should be more memory efficient.

Anyways, please report back whether it worked for you and if you experience other bugs.

ju-mu commented 3 years ago

Fantastic, thanks a lot for the patch!

Assuming location sorted .bam input is very reasonable imho and requested by many bam consuming bioinformatics tools. I guess this also makes it faster to write out results without having to resort?

I have now successfully run a few smaller samples and also two of the larger samples which crashed before with your new version and --two-pass enabled.

Initially I have used [bam, -i, in.bam, -o, out.bam, -k, 1, --algo, dir, --merge, avgqual, --two-pass] together with -Xms8G -Xmx20G -Xss20M and the smaller of the two samples completed successfully:

Number of input reads   198463091
Number of unique alignment positions    87265297
Average number of UMIs per alignment position   1.9714385433192303
Max number of UMIs over all alignment positions 26288
Number of reads after deduplicating     171238198
UMI collapsing finished in 4145.775 seconds!

Looking at htop, most of the ~170 threads were busy during the run and, as you mentioned, most probably due to the GC. At least I was able to run umicollapse single threaded with -XX:+UseSerialGC

However, the 20% larger sample which ran in parallel (without SGE) crashed again:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:548)
        at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:532)
        at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
        at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
        at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
        at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
        at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:421)
        at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:394)
        at htsjdk.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:507)
        at htsjdk.samtools.util.BinaryCodec.readInt(BinaryCodec.java:518)
        at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:261)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:866)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:840)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:834)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:802)
        at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:574)
        at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:553)
        at umicollapse.main.DeduplicateSAM.deduplicateAndMergeTwoPass(DeduplicateSAM.java:139)
        at umicollapse.main.Main.main(Main.java:178)
Number of input reads   242393723
Number of unique alignment positions    184028534
Average number of UMIs per alignment position   1.175781120986379
Max number of UMIs over all alignment positions 30056
Number of reads after deduplicating     215583429
UMI collapsing finished in 4761.479 seconds!

I am not a java expert, but maybe there is still room for improving the memory consumption? If only the reads at a single position are concurrently kept in memory, the total memory consumption should be only a few MBs.

ju-mu commented 3 years ago

At least for me the issue is fixed, as it is now mostly about finding the right java configuration for larger samples.

Daniel-Liu-c0deb0t commented 3 years ago

Thank you for testing the two pass version out! I am glad it works now.

I am not sure why Java GC wants to hog so many threads. At least serial GC seems to be able to force it to be single-threaded.

I've added to the readme to describe how it might be a good idea to use a small initial heap size and an extremely large max heap size, so Java can dynamically grow the heap, if needed. This seems to work for your case. Originally, I didn't want this because it is slower than using a fixed heap size. With all these Java GC settings, I kind of wish there was a way to do manual memory management.

In terms of total memory usage, there is a large hash map that needs to be kept that essentially records the number of reads per alignment coordinate, so we know we can free up memory as soon as we read in a certain amount of reads for an alignment coordinate. I just updated the code to slim down this hash map a little more, so the most recent version should be more memory efficient. Please try it out if you can.

Daniel-Liu-c0deb0t commented 3 years ago

By the way, I realized that you are deduplicating PE reads. The current implementation treats forwards and reversed reads the same way, and basically just deduplicates them separately. I am not sure if this is the behavior you want, as I know UMI-tools has some more options for this.

I also looked at how UMI-tools saves the reads in memory, and it seems that it expects reads sorted by alignment coordinate, and it uses a threshold to output when there is a large gap in alignment coordinates.

ju-mu commented 3 years ago

Hi, thanks a lot for your heads up!

I am currently comparing different workflows for a simple UMI de-duplication (rather than consensus read extraction) pipeline and umi_tools dedup has the disadvantage that it cannot de-duplicate based on average quality which is preferred for downstream variant calling over MAPQ. So the only option I see would be umi_tools group followed by picard which would be a rather slow implementation. Of course, the simple task of sequentially selecting the read with the highest average quality per UMI group could be alternatively achieved by a small custom script.

In fact it would be a fantastic improvement if umicollapse could optionally use TLEN to group read pairs like umi_tools group does which would make it a fast replacement for the umi_tools -> picard pipeline mentioned above.

Daniel-Liu-c0deb0t commented 3 years ago

Do you want to open a new issue for the TLEN feature (and possibly describe the exact functionality you want)? I am interested in implementing it if it does not require changing the current code too much. Another thing I am considering is implementing is only marking duplicates instead of deleting them. I don't know if this will be very useful, though.