GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

Memory error with many reference genomes as input #58

Open dpellow opened 4 years ago

dpellow commented 4 years ago

I'm trying to run BCALM on a very large database and I keep getting errors that look like:

mem: [ 100, 91622, 92822] MB EXCEPTION: Pool reserve() failed for 576716800000 (+ 1600 extra for 36 cores) bytes

Is there some way to get around this? More/fewer cores? Lower max-memory (I don't have much more RAM available)? Large value for -max-disk parameter?

Thanks!

rchikhi commented 4 years ago

how many distinct kmers reported at the beginning of the output?

dpellow commented 4 years ago

I don't see k-mers reported in the std output. In the .hf5 file there's a field that says estimated k-mer number: 69880669047 - just under 70 billion. (k=31, would trying a lower number help?)

rchikhi commented 4 years ago

can you please paste the full bcalm output? also what's the genome size / fastq file size? Bcalm on 70 billion distinct kmers is no small feat (what organism is that?) and 92 GB ram would be understandably low..

dpellow commented 4 years ago

I've pasted it below. It's for a metagenomic simulation - the concatenated reference genomes are ~70Gbp, so this matches the k-mer count assuming almost all are unique. From what I understood the error occurred with ~550GB RAM which is what I passed to the -max-memory flag

[Approximating frequencies of minimizers ]  100  %   elapsed:   5 min 1  sec   remaining:   0 min 0  sec   cpu: 100.0 %   mem: [ 321,  321,  516] MB 
[DSK: Collecting stats on all_download...]  100  %   elapsed:   0 min 2  sec   remaining:   0 min 0  sec   cpu: 100.0 %   mem: [  92,   92,  516] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  20   %   elapsed:  10 min 38 sec   remaining:  42 min 30 sec   cpu: 1589.0 %   mem: [100293, 100293, 100293] 
[DSK: Pass 1/1, Step 1: partitioning     ]  21   %   elapsed:  11 min 4  sec   remaining:  41 min 37 sec   cpu: 1601.2 %   mem: [101318, 101318, 101318] 
[DSK: Pass 1/1, Step 1: partitioning     ]  22   %   elapsed:  11 min 30 sec   remaining:  40 min 46 sec   cpu: 1612.5 %   mem: [102247, 102247, 102247] 
[DSK: Pass 1/1, Step 1: partitioning     ]  23   %   elapsed:  11 min 56 sec   remaining:  39 min 57 sec   cpu: 1623.0 %   mem: [103117, 103117, 103117] 
[DSK: Pass 1/1, Step 1: partitioning     ]  24   %   elapsed:  12 min 22 sec   remaining:  39 min 9  sec   cpu: 1632.6 %   mem: [103934, 103934, 103934] 
[DSK: Pass 1/1, Step 1: partitioning     ]  25   %   elapsed:  12 min 48 sec   remaining:  38 min 24 sec   cpu: 1641.8 %   mem: [104709, 104709, 104709] 
[DSK: Pass 1/1, Step 1: partitioning     ]  26   %   elapsed:  13 min 14 sec   remaining:  37 min 41 sec   cpu: 1650.3 %   mem: [105418, 105418, 105418] 
[DSK: Pass 1/1, Step 1: partitioning     ]  27   %   elapsed:  13 min 41 sec   remaining:  36 min 58 sec   cpu: 1658.2 %   mem: [106074, 106074, 106074] 
[DSK: Pass 1/1, Step 1: partitioning     ]  28   %   elapsed:  14 min 7  sec   remaining:  36 min 17 sec   cpu: 1665.7 %   mem: [106658, 106658, 106658] 
[DSK: Pass 1/1, Step 1: partitioning     ]  29   %   elapsed:  14 min 33 sec   remaining:  35 min 37 sec   cpu: 1672.7 %   mem: [107197, 107197, 107197] 
[DSK: Pass 1/1, Step 1: partitioning     ]  30   %   elapsed:  14 min 59 sec   remaining:  34 min 58 sec   cpu: 1679.3 %   mem: [107646, 107646, 107646] 
[DSK: Pass 1/1, Step 1: partitioning     ]  31   %   elapsed:  15 min 25 sec   remaining:  34 min 19 sec   cpu: 1685.5 %   mem: [108080, 108080, 108080] 
[DSK: Pass 1/1, Step 1: partitioning     ]  32   %   elapsed:  15 min 51 sec   remaining:  33 min 41 sec   cpu: 1691.4 %   mem: [108456, 108456, 108456] 
[DSK: Pass 1/1, Step 1: partitioning     ]  33   %   elapsed:  16 min 18 sec   remaining:  33 min 5  sec   cpu: 1697.0 %   mem: [108779, 108779, 108779] 
[DSK: Pass 1/1, Step 1: partitioning     ]  34   %   elapsed:  16 min 43 sec   remaining:  32 min 28 sec   cpu: 1702.2 %   mem: [109079, 109079, 109079] 
[DSK: Pass 1/1, Step 1: partitioning     ]  35   %   elapsed:  17 min 10 sec   remaining:  31 min 52 sec   cpu: 1707.2 %   mem: [109336, 109336, 109336] 
[DSK: Pass 1/1, Step 1: partitioning     ]  36   %   elapsed:  17 min 36 sec   remaining:  31 min 17 sec   cpu: 1712.0 %   mem: [109566, 109566, 109566] 
[DSK: Pass 1/1, Step 1: partitioning     ]  37   %   elapsed:  18 min 2  sec   remaining:  30 min 42 sec   cpu: 1716.5 %   mem: [109763, 109763, 109763] 
[DSK: Pass 1/1, Step 1: partitioning     ]  38   %   elapsed:  18 min 28 sec   remaining:  30 min 8  sec   cpu: 1720.8 %   mem: [109920, 109920, 109920] 
[DSK: Pass 1/1, Step 1: partitioning     ]  39   %   elapsed:  18 min 54 sec   remaining:  29 min 34 sec   cpu: 1724.9 %   mem: [110064, 110064, 110064] 
[DSK: Pass 1/1, Step 1: partitioning     ]  40   %   elapsed:  19 min 20 sec   remaining:  29 min 0  sec   cpu: 1728.8 %   mem: [110191, 110191, 110191] 
[DSK: Pass 1/1, Step 1: partitioning     ]  41   %   elapsed:  19 min 48 sec   remaining:  28 min 30 sec   cpu: 1729.1 %   mem: [102528, 110191, 110194] 
[DSK: Pass 1/1, Step 1: partitioning     ]  42   %   elapsed:  20 min 19 sec   remaining:  28 min 3  sec   cpu: 1726.8 %   mem: [97109, 110191, 110194] $
[DSK: Pass 1/1, Step 1: partitioning     ]  43   %   elapsed:  20 min 50 sec   remaining:  27 min 37 sec   cpu: 1723.6 %   mem: [97199, 110191, 110194] $
[DSK: Pass 1/1, Step 1: partitioning     ]  44   %   elapsed:  21 min 21 sec   remaining:  27 min 10 sec   cpu: 1720.6 %   mem: [97268, 110191, 110194] M
[DSK: Pass 1/1, Step 1: partitioning     ]  45   %   elapsed:  21 min 54 sec   remaining:  26 min 46 sec   cpu: 1715.4 %   mem: [91628, 110191, 110194] M
[DSK: Pass 1/1, Step 1: partitioning     ]  46   %   elapsed:  22 min 30 sec   remaining:  26 min 25 sec   cpu: 1706.1 %   mem: [80606, 110191, 110194] M
[DSK: Pass 1/1, Step 1: partitioning     ]  47   %   elapsed:  23 min 8  sec   remaining:  26 min 5  sec   cpu: 1695.0 %   mem: [80651, 110191, 110194] M
[DSK: Pass 1/1, Step 1: partitioning     ]  48   %   elapsed:  23 min 53 sec   remaining:  25 min 52 sec   cpu: 1677.5 %   mem: [57260, 110191, 110194] M
[DSK: Pass 1/1, Step 1: partitioning     ]  49   %   elapsed:  24 min 56 sec   remaining:  25 min 57 sec   cpu: 1639.9 %   mem: [45398, 110191, 110194] M
[DSK: Pass 1/1, Step 2: counting kmers   ]  50   %   elapsed:  27 min 38 sec   remaining:  27 min 38 sec   cpu: 1509.7 %   mem: [  99, 110191, 110194] MB

EXCEPTION: Pool reserve() failed for 629145600000 (+ 1600 extra for 36 cores) bytes
rchikhi commented 4 years ago

ahh thanks, there is something really odd. What does the input look like? A multifasta file where each entry is a reference genome? What's the longest sequence in there?

dpellow commented 4 years ago

Yes exactly - it's a multifasta of reference genomes. The longest is 16Mbp

rchikhi commented 4 years ago

that sounds quite reasonable but I never tried that sort of counting. The k-mer counter is really optimized for counting reads, or single reference genomes, but not really collections of reference genomes of various lengths, and probably its guess for internal memory allocation parameters is thrown off by the various genome sizes. What -max-memory value did you pass, 500000?

dpellow commented 4 years ago

Yes, I've tried 500GB-650GB, and also various values for max-disk. Are there parameters that you think would get it to work? Or another assembler that you would recommend?

rchikhi commented 4 years ago

Yeah, I'm sorry it didn't work, I could look into it sometimes. Meanwhile you could give https://github.com/pmelsted/bifrost a try for outputting unitigs!

dpellow commented 4 years ago

Thanks

rchikhi commented 4 years ago

I'm leaving this open, would like to make bcalm able to process collections of genomes

dpellow commented 4 years ago

OK - It could be a good idea as a number of recent k-mer indexing/querying tools recommend using bcalm, not necessarily in the usual context of assembly of sequencing reads. https://github.com/kamimrcht/REINDEER https://github.com/Malfoy/Blight

rchikhi commented 4 years ago

REINDEER is not limited by the bug you point out, as it makes BCALM process one genome (= one color) after the other, i.e. smaller chunks at a time.

pashadag commented 4 years ago

David, This may be a use-case for TwoPaCo:

https://github.com/medvedevgroup/TwoPaCo

dpellow commented 4 years ago

David, This may be a use-case for TwoPaCo:

Thanks! I'll look into it.

rchikhi commented 4 years ago

I may have identified the issue with BCALM taking too much memory with many ref genomes. It can be temp-fixed by doing this change.

diff --git a/gatb-core/src/gatb/kmer/impl/SortingCountAlgorithm.cpp b/gatb-core/src/gatb/kmer/impl/SortingCountAlgorithm.cpp
index c87c6f8..3e05d36 100644
--- a/gatb-core/src/gatb/kmer/impl/SortingCountAlgorithm.cpp
+++ b/gatb-core/src/gatb/kmer/impl/SortingCountAlgorithm.cpp
@@ -1231,7 +1231,7 @@ void SortingCountAlgorithm<span>::fillPartitions (size_t pass, Iterator<Sequence
                        /** We launch the iteration of the sequences iterator with the created
                         * functors. */

-                       size_t groupSize = 1000;
+                       size_t groupSize = 10;
                        bool deleteSynchro = true;

(reduces mem by 5x on my end) Will further study a more robust change, but meanwhile @dpellow you could try it if you'd like.