Closed schmeing closed 3 years ago
Hi Schmeing, Does the memory usage of the program keep increasing?
No. It was constant at roughly 60Gb.
This is unexpected. I have no idea why this happened. Did you try to re-run the "Contiger"? And may I know where I can access the public datasets used in your experiment? It seems that they are not in the SRA database.
The public datasets are part of the European Nucleotide Archive. I expected that the SRA has mirrored them, but it seems this is not the case. The datasets are a selection from the following project: https://www.ebi.ac.uk/ena/browser/view/PRJEB33197 I did not restart the Contiger so far, because I expected it to works deterministically, but I will give it a try now.
Restarting it worked, but gave errors again:
Params:
kmer size: 47
kmer min. abundance: 2
solid kmer min. abundance:2
solid kmer max. abundance:1000000
threads: 30
2020-09-19 12:46:49
[CQF] load cqf from disk
[CQF] cqf loaded!
2020-09-19 12:47:11
[Unitig] find unitigs
[Error] kmer not found![Error] kmer not found!
[Error] kmer not found!
[Error] kmer not found!
[Unitig] 107077056 unitigs reported of length 8039223375 bp in total
[Unitig] among them, there are 2301 palindromes.
[Unitig] build unitig graph.
2020-09-19 13:37:13
[Dump] save the unitig graph to file.
2020-09-19 13:40:39
2020-09-19 13:47:44
The final minia command still worked, so I don't know if the errors are important. I will call the script running everything again and see if that reproduces the crash.
Thanks! Schmeing. That the 'Contiger' finished around ~1 hour for the human data with ~100x sequencing depth is as expected. I don't know why previously it took over one week without finishing. For the "k-mer not found" errors, it seems that it does not affect the usability of the unitig graph. I am trying to reproduce your experiments. But I will need some time since it uses a lot of data. Will update you ASAP.
I reran the script and the assembly ran through without problems in 4h30 (including ntcard and minia). This time it only produced two [Error] kmer not found!
. I am sorry that I cannot reproduce the issue of the endless running. I know that it makes it basically impossible to debug.
Hi Schmeing, I followed your scripts to call the programs twice, and could not reproduce the [Error] kmer not found!
errors.
In the implementation of Contiger
, we used concurrent_hash_map
from tbb library. The [Error] kmer not found!
is reported only when a k-mer that should have been inserted into concurrent_hash_map
is not found in it. Since multi-thread access of the data structure is allowed, so I am wondering whether it is related to this library. I am using tbb library (version 2019.7
) installed using conda.
BTW, the average base error rate estimated by taking the singleton k-mers and k-mers occurring twice as potential false k-mers is 0.0018287
, which is smaller than the error rate specified in your script 0.00306905
. Maybe you considered also k-mers occurring more than twice as potential false k-mers.
Did you manually follow my script or ran it as a script? Manually calling it results in no errors for me, but running it as a script does. Waiting 30 seconds between each program call in the script seems to fix this. My suspicion is that the output files are still not fully written when the next program opens them. Do you flush the output buffer once the program is finished?
I manually ran all the commands in your script previously. The close
function is called to close the file after outputting everything. So it should not be caused by the buffering issue as you suggested. I tried to run it as a bash script, and finally reproduced the [Error] kmer not found!
errors. But the errors did not consistently appear in each run, so it took me some time to find out the reasons. It turns out that there is a synchronization problem in the multi-thread implementation of finding unitigs, which will cause the [Error] kmer not found!
errors under a certain condition. I updated the source codes, and reran the experiments a few times to make sure that the errors did not appear. Thank you! Schmeing!
I can confirm it works now. Thank you for fixing it.
Hi,
after the ecoli assembly is working I am trying a human assembly with a public datasets being split into the following runs: ERR3518653, ERR3519640, ERR3522379, ERR3522985, ERR3528464, ERR3528894, ERR3530147, ERR3530148, ERR3556082, ERR3556092, ERR3559869, ERR3561121, ERR3561123, ERR3561210, ERR3561217, ERR3561247
The Contiger step did not finish after over a week of runtime. I also noticed that it only used a single core despite 30 being specified. I am calling the programs in the following way:
The output of all three commands is: