Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
352 stars 52 forks source link

nextgraph, could it run with multi thread? #52

Closed fangdm closed 4 years ago

fangdm commented 4 years ago

Hi, Dr. Hu, thanks for your excellent work at NextDenovo. Now, i had a genome with about 10Gb genome size, and had finished the other steps except for nextgraph (NextDenovo version: NextDenovo-v2.2-beta.0) , successfully. But it was very slow for running nextgraph, it's been 24 hours, and only showed "Initialize graph and reading...", So, could it run with multi thread? I didn't find any parameters for that, or some other ways speed up ?

Thank you.

moold commented 4 years ago

Hi, Nextgraph does not contain the multi-thread function, becasue the bottleneck is IO and the multi-thread function may not speed up, you can check the the total size of input ovl files? But in general, it should not take 24 hours for a 10g genome.

fangdm commented 4 years ago

Thank you. I had checked, but there were no warning errors or empty files. I'm going to wait a more time, Another questions:
1) For this big genome, ~120x raw data of PacBio, how much memory do i need to run nextgraph? 2) I see the previous issues about NextDenovo: https://github.com/Nextomics/NextDenovo/issues/31 For the lastest version, can it assemble a larger genome with genome size more than 3.5g? For our tasks with "qsub", the maximum memory can be up to 3T, Is it enough to finish the assembly?

Thank you very much.

moold commented 4 years ago

Hi,

  1. NextGraph may require 100-200Gb or even less, depending on the total size of the input ovl file (not equal to the file size).
  2. 3T is enough, but the released version does not support assembly for genome size > 3.5GB. You can try to use wtdbg, smrtdenovo, miniasm, fly or other tools to assemble with corrected reads generated by NextDenovo (use correction_options = -b ), which usually produce a better result than using raw reads.
fangdm commented 4 years ago

So bad, we have no extra money for the company. Could you release us a version for the big genome? ha ha. I had tried wtdbg and miniasm, the assembled N50 was not good.

Thank you.

moold commented 4 years ago

Have you tried other assemblers with reads corrected by NextDenovo (with option:correction_options = -b )? I am sorry and I do not have permission to send the unlimited version to you and you can ask our support team for help.

fangdm commented 4 years ago

Hi, i had enough PacBio data, ~120X, why do you suggest the parameter -b?

I had tried wtdbg and miniasm with raw data, but not corrected reads by NextDenovo. If i use the corrected reads to assembly with minimap2 & miniasm, how to set parameters?

For another 600M genome, >100X, assembly by NextDenovo without -b, compare the assembly and reference, i found that the results had a lot of sequencing errors, but after polish with NextPolist, its accuracy had improved, significantly. It may be caused by the parameter -b?

Thank you.

moold commented 4 years ago

Hi,

  1. by setting -b , NextCorrect will ignore some reads that are not useful for NextGraph, while these reads may be useful for other assembly tools. If you have ran NextDenovo without -b option, just change correction_options to correction_options = -b in the config file, and remove the following done files: 02.cns_align/01.get_cns.sh.done and 02.cns_align/01.get_cns.sh.work/get_cns*/nextDenovo.sh.done and rerun all the pipeline. NextDenovo will rerun all get_cns tasks and ignore corrected reads in 02.cns_align/01.get_cns.sh.work/get_cns*/cns.fasta.
  2. -b does not affect the assembly accuracy for NextDenovo, and multiple rounds of polish step are required for long reads assembly.
haoliu1213 commented 4 years ago

our system IO can reach 6GB/s on one node, and one iozone thread achieves 4GB/s read and write IO, but the nextgraph can't take it fully , it's only about 1GB/s, so can you explore some skills to use full IO for nextgraph?