bioinfologics / w2rap-contigger

An Illumina PE genome contig assembler, can handle large (17Gbp) complex (hexaploid) genomes.
http://bioinfologics.github.io/the-w2rap-contigger/
MIT License
44 stars 14 forks source link

Segmentation fault in Step 5 #46

Open theaidenlab opened 4 years ago

theaidenlab commented 4 years ago

Hi, Using a relatively large data set (~3B reads) we are experiencing some problem in Step 5 using the "step7_fix" branch. (Usually this branch is reliable and still gives the best result comparing to the latest master branch) The machine that was in use had 2TB RAM but w2rap-contigger peaked only around 1.2TB before it crashed with "Segmentation fault". These are the last couple of lines before the crash:

...
Fri Aug 07 09:05:43 2020: 2132052 blobs processed, paths found for 2127214
Fri Aug 07 09:05:43 20202.74 hours spent in local assemblies.
Fri Aug 07 09:05:43 2020: patching
Fri Aug 07 09:05:52 2020: 8.5 seconds used patching
Fri Aug 07 09:09:20 2020: building hb2
1.58 minutes used in new stuff 1 test
memory in use now = 554977333248
Fri Aug 07 09:42:06 2020: back from buildBigKHBVFromReads
32.8 minutes used in new stuff 2 test
peak mem usage = 626.02 GB
2.17 minutes used in new stuff 5
Fri Aug 07 09:55:13 2020: finding interesting reads
Fri Aug 07 09:56:50 2020: building dictionary
Fri Aug 07 10:02:39 2020: reducing
We need 1 passes.
Expect 2114087 keys per batch.
Provide 5285216 keys per batch.
There were 173 buffer overflows.
Fri Aug 07 12:42:39 2020: kmerizing
We need 1 passes.
Expect 3328427 keys per batch.
Provide 8321066 keys per batch.
Fri Aug 07 14:30:53 2020: cleaning
Fri Aug 07 14:34:53 2020: finding uniquely aligning edges
/mnt/ssd1/w2rap.sh: line 15:  1487 Segmentation fault      (core dumped)

We tried to lower the amount of reads to 2.3B reads but had similar result, only when we used around 1.6B reads we managed to pass step 5. In case it helps I've saved the "core dumps". We would really appreciate some help finding a solution to this problem and/or guidance how to debug it further. Thanks,

bjclavijo commented 3 years ago

Sorry about the delay. Not sure if you're still interested in this, but if 1.6B reads worked maybe just leave it at that, I would say. Some of the heuristics were never update to use large coverages on larger genomes, so anything higher than 100x coverage is pushing it too far in these scenarios as there's a few hardcoded thresholds.

I can't really think top of my mind what could be going wrong but I would bet some threshold is letting way too many elements join a collection that then overflows. If you really need this looked at, I would say run it on GDB and let us know where the overflow is happening. Otherwise, I will just close this in a few days.