Christina-hshi / SH-assembly

K-mer counting with low memory consumption enables de novo assembly of 106x human sequence data in 2.7 hours
Other
7 stars 1 forks source link

Contiger runs over a week without finishing #4

Closed schmeing closed 3 years ago

schmeing commented 4 years ago

Hi,

after the ecoli assembly is working I am trying a human assembly with a public datasets being split into the following runs: ERR3518653, ERR3519640, ERR3522379, ERR3522985, ERR3528464, ERR3528894, ERR3530147, ERR3530148, ERR3556082, ERR3556092, ERR3559869, ERR3561121, ERR3561123, ERR3561210, ERR3561217, ERR3561247

The Contiger step did not finish after over a week of runtime. I also noticed that it only used a single core despite 30 being specified. I am calling the programs in the following way:

ntcard -t30 -k47 reads1.fq.gz reads2.fq.gz
CQF-deNoise -t 30 -i input_list.txt -o k47.cqf -k 47 -N 285948060395 -n 3046611112 -e 0.00306905 -f g
Contiger -t 30 -k 47 -i input_list.txt -c k47.cqf -o unitigs.fa -f g

The output of all three commands is:

Runtime(sec): 12583.3274
CQF-deNoise settings:
qb: 34
hb: 42
thread_num: 30
K: 47
number of true k-mers: 3046611112
tolerable wrong removal rate: 3.28234e-10
number of deNoise rounds: 4
deNoise after processing all k-mers: false
number of unique k-mers triggering deNoise: 10739566640
#deNoise rounds leading to the same size of CQF: [4, 42]
wrong removal rate leading to the same #deNoise rounds: [1.25194e-29, 1.37236e-06]

2020-09-10.22:12:07
Start to build K-mer spectrum...
2020-09-10.22:35:33
Ready for DeNoise: ndistinct_elts/total_elts.10756816381/105900325616 ndistinct_true_elts.3046611112
Finished DeNoise: ndistinct_elts/total_elts.3045617475/98189126710 ndistinct_true_elts.3046611112
2020-09-10.22:35:46
2020-09-10.22:46:58
Ready for DeNoise: ndistinct_elts/total_elts.10758832408/130141216052 ndistinct_true_elts.3046611112
Finished DeNoise: ndistinct_elts/total_elts.3137527241/122519910885 ndistinct_true_elts.3046611112
2020-09-10.22:47:12
Time for building K-mer spectrum without dumping to disk: 2212 seconds.
Estimated probability of true k-mers with freq<=4 is: 1.05561e-08
Finished building K-mer spectrum!
Time for building K-mer spectrum: 2229 seconds.
Params: 
        kmer size:                47
        kmer min. abundance:      2
        solid kmer min. abundance:2
        solid kmer max. abundance:1000000
        threads:                  30
2020-09-10 22:49:19
[CQF] load cqf from disk
[CQF] cqf loaded!
2020-09-10 22:49:33
[Unitig] find unitigs
Christina-hshi commented 4 years ago

Hi Schmeing, Does the memory usage of the program keep increasing?

schmeing commented 4 years ago

No. It was constant at roughly 60Gb.

Christina-hshi commented 4 years ago

This is unexpected. I have no idea why this happened. Did you try to re-run the "Contiger"? And may I know where I can access the public datasets used in your experiment? It seems that they are not in the SRA database.

schmeing commented 4 years ago

The public datasets are part of the European Nucleotide Archive. I expected that the SRA has mirrored them, but it seems this is not the case. The datasets are a selection from the following project: https://www.ebi.ac.uk/ena/browser/view/PRJEB33197 I did not restart the Contiger so far, because I expected it to works deterministically, but I will give it a try now.

schmeing commented 4 years ago

Restarting it worked, but gave errors again:

Params:
        kmer size:                47
        kmer min. abundance:      2
        solid kmer min. abundance:2
        solid kmer max. abundance:1000000
        threads:                  30
2020-09-19 12:46:49
[CQF] load cqf from disk
[CQF] cqf loaded!
2020-09-19 12:47:11
[Unitig] find unitigs
[Error] kmer not found![Error] kmer not found!

[Error] kmer not found!
[Error] kmer not found!
[Unitig] 107077056 unitigs reported of length 8039223375 bp in total
[Unitig] among them, there are 2301 palindromes.
[Unitig] build unitig graph.
2020-09-19 13:37:13
[Dump] save the unitig graph to file.
2020-09-19 13:40:39
2020-09-19 13:47:44
schmeing commented 4 years ago

The final minia command still worked, so I don't know if the errors are important. I will call the script running everything again and see if that reproduces the crash.

Christina-hshi commented 4 years ago

Thanks! Schmeing. That the 'Contiger' finished around ~1 hour for the human data with ~100x sequencing depth is as expected. I don't know why previously it took over one week without finishing. For the "k-mer not found" errors, it seems that it does not affect the usability of the unitig graph. I am trying to reproduce your experiments. But I will need some time since it uses a lot of data. Will update you ASAP.

schmeing commented 4 years ago

I reran the script and the assembly ran through without problems in 4h30 (including ntcard and minia). This time it only produced two [Error] kmer not found!. I am sorry that I cannot reproduce the issue of the endless running. I know that it makes it basically impossible to debug.

Christina-hshi commented 4 years ago

Hi Schmeing, I followed your scripts to call the programs twice, and could not reproduce the [Error] kmer not found! errors. In the implementation of Contiger, we used concurrent_hash_map from tbb library. The [Error] kmer not found! is reported only when a k-mer that should have been inserted into concurrent_hash_map is not found in it. Since multi-thread access of the data structure is allowed, so I am wondering whether it is related to this library. I am using tbb library (version 2019.7) installed using conda. BTW, the average base error rate estimated by taking the singleton k-mers and k-mers occurring twice as potential false k-mers is 0.0018287, which is smaller than the error rate specified in your script 0.00306905. Maybe you considered also k-mers occurring more than twice as potential false k-mers.

schmeing commented 4 years ago

Did you manually follow my script or ran it as a script? Manually calling it results in no errors for me, but running it as a script does. Waiting 30 seconds between each program call in the script seems to fix this. My suspicion is that the output files are still not fully written when the next program opens them. Do you flush the output buffer once the program is finished?

Christina-hshi commented 4 years ago

I manually ran all the commands in your script previously. The close function is called to close the file after outputting everything. So it should not be caused by the buffering issue as you suggested. I tried to run it as a bash script, and finally reproduced the [Error] kmer not found! errors. But the errors did not consistently appear in each run, so it took me some time to find out the reasons. It turns out that there is a synchronization problem in the multi-thread implementation of finding unitigs, which will cause the [Error] kmer not found! errors under a certain condition. I updated the source codes, and reran the experiments a few times to make sure that the errors did not appear. Thank you! Schmeing!

schmeing commented 3 years ago

I can confirm it works now. Thank you for fixing it.