Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
352 stars 52 forks source link

Questions about "seed_cutoff" option #49

Open TypicalSEE opened 4 years ago

TypicalSEE commented 4 years ago

Hi, Dr. Hu, thanks for your excellent work at NextOmics. I have a few questions about the "seed_cutoff" option and I would appreciate it very much if you could help me:

  1. If I set "read_cutoff" to 1000 and "seed_cutoff" to 1001, will all reads that are longer than 1001 bp be corrected?
  2. If I have enough CPUs, should I correct all the reads longer than 1kb I have on my hand?
  3. I did a small test on the "seed_cutoff" option. The 1st time I set seed_cutoff = 13k, and the best one of the 100 results is: assembly_size = 550Mb, contigN50 = 1.1Mb. The 2nd time I set seed_cutoff = 1001 to correct all the reads and the best result is: assembly_size = 665Mb, contigN50 = 940kb. The genome size estimation(by kmerfreq) is about 1Gb and we only have ~20Gb nanopore data, by the way. Do you think the difference between the assembly sizes(550Mb vs. 665Mb) is because of the different seed_cutoff values? If so, how should I decide how much data to use during the reads correction stage? Thanks in advance! YU Jin.
moold commented 4 years ago

Hi, 1, No, although NextDenovo will try to correct reads with length longer than 1001bp, but it will filter some low quality, low depth reads and..., so the output corrected read is much less than the raw reads with length > 1001bp.

2 & 3, Your data is not enough for assembly using the currently version of NextDenovo with default options, because all default options are optimize with 60-100x NanoPore data. So it will produce an unexpected assembly result. But if you still want to use NextDenovo to do the assembly, you can try to use the option correction_options = -b and change -k 20 in sort_options and than rerun all pipeline, while I can not guarantee you can get a good result. You can try to other assemblers.

TypicalSEE commented 4 years ago

Thanks for your reply, it helps a lot. But what still confuses me is: should I set seed_cutoff as low as possible(1001, for example) when I have enough nanopore data and enough CPUs? Will correcting as many reads as possible improve assembly quality? Thanks again.

moold commented 4 years ago

Yes, but I recommend using bin/seq_stat to calculate the expected seed cutoff.

gitcruz commented 4 years ago

Dear Dr. Hu,

I've recently run nextdenovo using 33x ONT reads from a 1Gb genome. After running seq_stats the suggested seed cutoff was 0 bp. However as the minimum read length was 1000bp I set the seed_cutoff to 1.1k. Results were a bit dissapointing with N50=2Mb. As a comparison for a mammalian genome with 70x Pacbio I've got an N50 of 76Mb!!!

I wonder if it worths tweaking some of the parameters as suggested above (correction_options = -b and change -k 20 in sort_options) or would be necessary to gather more data to reach 60x (that is not always possible)?

I also would like to know if there is some document with more detailed help on this assembler. do you have any sort of manual, white paper or do you plan to upload an MS to biorxiv?

The results on the mammal are very encouraging, the program is definitely a tool to consider for achieving chomosome-scale assemblies.

Thanks, Fernando

moold commented 4 years ago

Hi, the input data is not enough, and the seed length is too short, you can see the default value of option -min_len_seed in nextcorrect.py is 10k, so most of corrected seeds will be filtered, currently, the default options are optimized for input data size >= 60x and seed length >=20Kb , Otherwise, it will produce some unexpected results and need be careful to check assembly quality. BTW I am now preparing the manuscript of NextDenovo, I also will provide some default options for short seeds and 30x input data in the next release. But, if you want to get a better assembly result, it is recommend to sequencing >=60X data using NanoPore ultra-long libraries.

gitcruz commented 4 years ago

Thanks,

Having more data would be always great. I would love to use ultralong reads, but as far as I know it will require a lot more input DNA.

I look forward to read the manuscript.

Best, Fernando

El sáb., 25 jul. 2020 11:05, Hu Jiang notifications@github.com escribió:

Hi, the input data is not enough, and the seed length is too short, you can see the default value of option -min_len_seed in nextcorrect.py is 10k, so most of corrected seeds will be filtered, currently, the default options are optimized for input data size >= 60x and seed length >=20Kb , Otherwise, it will produce some unexpected results and need be careful to check assembly quality. BTW I am now preparing the manuscript of NextDenovo, I also will provide some default options for short seeds and 30x input data in the next release. But, if you want to get a better assembly result, it is recommend to sequencing >=60X data using NanoPore ultra-long libraries.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Nextomics/NextDenovo/issues/49#issuecomment-663831791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB34KVJ7LG7OHIJFQWLLE3LR5KN43ANCNFSM4KMFXSMQ .

gitcruz commented 3 years ago

Hi Hu,

I just read that latest release (version 2.3.1), use non-seed reads to correct structural & base errors if seed depth < 35 I guess those are the default options you mentioned above. Thus, should I expect also better results in cases with ONT coverage >=30x? Did you run tests on that front?

Thanks, Fernando

moold commented 3 years ago

NextDenovo is only an assembly software, so if you need a more accuracy assembly, you can try to NextPolish

gitcruz commented 3 years ago

Hi, Ok, I see. The option is just affecting to base level accuracy (i.e. use non-seed reads to correct structural & base errors if seed depth < 35). I was thinking about getting better contiguity and assembly quality (fewer miss-assemblies) with less data. Thus, v2.3.1 still requires coverage >= 60x for optimal results, right? Thanks, Fernando

moold commented 3 years ago
  1. NO, the main purpose of this step is to correct structural errors, using mapping depth information and overlapped coordinates between seeds.
  2. Yes.