Christina-hshi / SH-assembly

K-mer counting with low memory consumption enables de novo assembly of 106x human sequence data in 2.7 hours
Other
7 stars 1 forks source link

Segmentation fault running CQF-deNoise #2

Closed schmeing closed 4 years ago

schmeing commented 4 years ago

Hi,

I tried to run an SH-assembly on public dataset SRR3191692 and got: Segmentation fault (core dumped) with no further output when calling CQF-deNoise. I called it with the following parameters: CQF-deNoise -i input_list.txt -o cqf_count -k 47 -N 400000000 -f g -n 5000000 and input_list.txt looks like this:

SRR3191692_1.fastq.gz
SRR3191692_2.fastq.gz
Christina-hshi commented 4 years ago

Hi Schmeing, You need to specify either <alpha> or <errorProfile>. A detailed manual of CQF-deNoise can be found at https://github.com/Christina-hshi/CQF-deNoise.git I have also updated the source code a little bit, such that it will report an error if none of them is specified. Thanks!

schmeing commented 4 years ago

Thank you for the very fast response. I updated to the new version and with the help of the new README section updated my program call to: CQF-deNoise -i input_list.txt -o k28.cqf -k 28 -N 3808479766 -n 4577116 -e 0.002477091 -f g Unfortunately I still get a segmentation fault after a short while:

CQF-deNoise settings:
qb: 24
hb: 32
thread_num: 16
K: 28
number of true k-mers: 4577116
desired overall false removal probability: 2.18478e-07
number of times deNoise being called: 374
deNoise after processing all k-mers: false
number of distinct k-mers triggering deNoise: 5258459
#deNoise rounds leading to the same size of CQF: [374, 0]
Wrong removal rate leading to same #deNoise rounds: [3.68928e-58, +oo]
2020-09-09.08:36:52
Start to build K-mer spectrum...
2020-09-09.08:36:56
Ready for DeNoise: ndistinct_elts/total_elts.7573925/71135289 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4609160/68170524 ndistinct_true_elts.4577116
2020-09-09.08:36:56
2020-09-09.08:36:59
Ready for DeNoise: ndistinct_elts/total_elts.6553274/106145794 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4627828/104220348 ndistinct_true_elts.4577116
2020-09-09.08:36:59
2020-09-09.08:37:02
Ready for DeNoise: ndistinct_elts/total_elts.6910576/144728144 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4646357/142463925 ndistinct_true_elts.4577116
2020-09-09.08:37:02
2020-09-09.08:37:04
Ready for DeNoise: ndistinct_elts/total_elts.6292392/170284176 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4656192/168647976 ndistinct_true_elts.4577116
2020-09-09.08:37:04
2020-09-09.08:37:07
Ready for DeNoise: ndistinct_elts/total_elts.6038743/196284118 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4665271/194910646 ndistinct_true_elts.4577116
2020-09-09.08:37:07
2020-09-09.08:37:09
Ready for DeNoise: ndistinct_elts/total_elts.6378773/220077381 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4675308/218373916 ndistinct_true_elts.4577116
2020-09-09.08:37:09
2020-09-09.08:37:11
Ready for DeNoise: ndistinct_elts/total_elts.7095994/258395102 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4696749/255995857 ndistinct_true_elts.4577116
2020-09-09.08:37:11
2020-09-09.08:37:14
Ready for DeNoise: ndistinct_elts/total_elts.6207176/288511978 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4709913/287014715 ndistinct_true_elts.4577116
2020-09-09.08:37:14
2020-09-09.08:37:16
Ready for DeNoise: ndistinct_elts/total_elts.6916633/322030779 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4725647/319839793 ndistinct_true_elts.4577116
2020-09-09.08:37:16
2020-09-09.08:37:19
Ready for DeNoise: ndistinct_elts/total_elts.6961311/354845802 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4741492/352625983 ndistinct_true_elts.4577116
2020-09-09.08:37:19
2020-09-09.08:37:21
Ready for DeNoise: ndistinct_elts/total_elts.6461932/387630739 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4755004/385923811 ndistinct_true_elts.4577116
2020-09-09.08:37:21
2020-09-09.08:37:23
Ready for DeNoise: ndistinct_elts/total_elts.6334529/418425714 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4768284/416859469 ndistinct_true_elts.4577116
2020-09-09.08:37:23
2020-09-09.08:37:26
Ready for DeNoise: ndistinct_elts/total_elts.6793990/446853143 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4780128/444839281 ndistinct_true_elts.4577116
2020-09-09.08:37:26
2020-09-09.08:37:28
Ready for DeNoise: ndistinct_elts/total_elts.6561868/472339491 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4789162/470566785 ndistinct_true_elts.4577116
2020-09-09.08:37:28
2020-09-09.08:37:30
Ready for DeNoise: ndistinct_elts/total_elts.6035997/495570064 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4796807/494330874 ndistinct_true_elts.4577116
2020-09-09.08:37:31
2020-09-09.08:37:33
Ready for DeNoise: ndistinct_elts/total_elts.6602024/521825324 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4805353/520028653 ndistinct_true_elts.4577116
2020-09-09.08:37:33
2020-09-09.08:37:35
Ready for DeNoise: ndistinct_elts/total_elts.6490087/547519613 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4815725/545845251 ndistinct_true_elts.4577116
2020-09-09.08:37:35
2020-09-09.08:37:37
Ready for DeNoise: ndistinct_elts/total_elts.5980808/573346330 ndistinct_true_elts.4577116
Finished DeNoise: ndistinct_elts/total_elts.4823757/572189279 ndistinct_true_elts.4577116
2020-09-09.08:37:37
Segmentation fault (core dumped)
Christina-hshi commented 4 years ago

The "segmentation error" is most likely caused by using up the space allocated to CQF. The size of the CQF is estimated by considering the number of unique true k-mers, the number of unique false k-mers, and the mean coverage of true k-mers, which are estimated based on the parameters given. If the estimation of these statistics is very different from the actual values, CQF will be set to either too large or too small. When it is too small, the space of CQF will be used up and further insertions to CQF will cause segmentation errors.
May I know how you determine the values of the parameters? Did you use ntCard? If yes, would you mind to show the first 5 lines of the output file produced by ntCard?

schmeing commented 4 years ago

Since the coverage is super high (1000x) I used everything up to f19 as erroneous k-mers. This brings the true k-mers in the range of the ecoli genome size.

F1      3808479766
F0      197633152
f1      165363660
f2      18092384
f3      3486316
f4      1149692
f5      859132
f6      868277
f7      835569
f8      729207
f9      564317
f10     423188
f11     279194
f12     180215
f13     107030
f14     55179
f15     29764
f16     16500
f17     7618
f18     5539
f19     3255
f20     1869
Christina-hshi commented 4 years ago

I would suggest using only k-mers with occurrence counts <=2 as potential erroneous k-mers, even though the sequencing depth is extremely high in your case. When the number of deNoise rounds is m, CQF-deNoise can not guarantee that all k-mers with occurrence counts <=m will be removed, depending on whether these k-mers occur more than once between any two consecutive deNoise rounds. So by using k-mers with occurrence counts > 2 as the potential "true k-mers", the program will allocate enough space for CQF in your case.

schmeing commented 4 years ago

Thank you, with the updated parameters it ran through. However, the next step (Contiger) was scrolling pages and pages of [Error] kmer not found! and finished with 4410524 unitigs reported of length 125519402 bp in total, which is way too much sequence for an E. coli genome. Finally, the minia crashes again:

Minia 3, git commit de0334e
iterating on 4485721 nodes on disk
[removing tips,    pass  1               ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [  88,   88,  100] MB unexpected error: li.unitig=4532740, unitig_deleted.size()=4410524
simplePathLongest_avance stopped at #id:155 GCGATTTTCCTGCTGCGCGCCTACCGCA because of surprising lack of in-neighbor given that we had to come from somewhere
simplePathLongest_avance stopped at #id:1962 TAGTCTTTAAGCTGAAAGATTGCAAAGA because of surprising lack of in-neighbor given that we had to come from somewhere
minia: Minia/thirdparty/gatb-core/gatb-core/src/gatb/debruijn/impl/GraphUnitigs.cpp:1793: void gatb::core::debruijn::impl::GraphUnitigsTemplate<span>::simplePathLongest_avance(const gatb::core::debruijn::impl::NodeGU&, gatb::core::debruijn::impl::Direction, int&, int&, bool, float&, std::__cxx11::string*, std::vector<gatb::core::debruijn::impl::NodeGU>*) [with long unsigned int span = 32ul; std::__cxx11::string = std::__cxx11::basic_string<char>]: Assertion `in_neighbors >= 1' failed.

I called all the steps with the following commands:

ntcard -t2 -k28 SRR3191692_1.fastq.gz SRR3191692_2.fastq.gz
CQF-deNoise -i input_list.txt -o k28.cqf -k 28 -N 3808479766 -n 14177108 -e 0.00194 -f g
Contiger -k 28 -i input_list.txt -c k28.cqf -o unitigs.fa -f g
minia -kmer-size 28 -unitig -in unitigs.fa
Christina-hshi commented 4 years ago

I followed your steps to assemble the E.coli genome. The problem was caused by a function failing to handle the case when k is multiples of 4 (e.g. 4*7 = 28 bp). Since we used k with odd values (e.g. 47) in the test, so this bug was not found. After fixing it, now the program works with k=28. I have updated the source code. Thanks! Besides, I would suggest trying a larger k (e.g. 47, 55) and using an odd number such that a k-mer and its reverse complement will never be the same sequence.

schmeing commented 4 years ago

Thank you! If a larger k like 47 is better I suggest changing the example workflow to include it, because I had 47 taken from the paper and changed it once I saw that the example workflow had 28.

Christina-hshi commented 4 years ago

Hi Schmeing! Thank you for your suggestion! I have updated the example to use k=47.