Parameters for plant genome

mycecilia commented 5 years ago

Hi Jason, The assembler is at lightening speed! I tried it on 40x Nanopore reads corrected by canu, and it finished in 4 hours, while canu is still crawling after over 2 months. However, I still need to tweak the parameters, because the assembly I got is only half the size of the genome ( ~ 1G). Do you have any suggestion on parameters that would help to improve the assembly? The parameters I used were the same as the example in your slides: --with-consensus --shimmer-r 3 --best_n_ovlp 8

Thank you for sharing this program. Shiyu

cschin commented 5 years ago

What is the error corrected read coverage? And any idea about the error rate?

cschin commented 5 years ago

@mycecilia ok, I missed it, you said it is 40x. That should be enough. The current Peregrine code may not be able to handle reads > 1% error. It is a bit challenge to guess how the data looks like. We have recently test the assembler for maize, the default parameters handle it well for Maize error corrected > 18kb. If the error rate is a bit higher ( I am just guessing), you can make the initial minimizer window size smaller by setting --shimmer-w to 60 from the default 80. If the genome is indeed repetitive, you can make --mc_upper and --ovlp_upper higher too. However, it will be useful to know what the root cause really it.

Is your genome diploid / tetraploid? If so do you know the heterozygosity between the haplotypes. Is it possible the half size caused collapsed between the repeats?

mycecilia commented 5 years ago

@cschin My genome is an allotetraploid (2n=4x=44). We've had an assembly with another set of data using Falcon a couple years ago. Now we want to do a de novo assembly with Nanopore sequences for benchmarking. I also estimated the error rate of corrected reads at 4.6%. Thank you for the suggestions. I'm going to investigate more on what's going on with the assembly.

mycecilia commented 5 years ago

I don't know if this could be an indication of underlying issues. I tried WTDBG2 and got a 2G assembly about double the genome size. And when I map the peregrine assembly and wtdbg2 assemblies against our FALCON reference, with minimap2 5% divergence setting, the reference regions assembled by peregrine and wtdbg2 are almost complementary to each other. The commonly assembled reference regions usually have one peregrine contig but multiple wtdbg2 contigs.

cschin commented 5 years ago

@mycecilia The error rate 4.6% is a bit high for Peregrine. However, it seems to me that you still get some contigs. That is great. The default uses k=16. with 4.6% error, many of k=16mer will have errors. It might be interested in seeing how k=14 perform and w=60. Just to be careful, with smaller k and w, you might have a bigger indexes and false positives (of repeats). You might want to watch the computational resource usage and adjust the parameters accordingly.

The comparison between WTDBG2 and Peregrine is very interesting. From assembly algorithm design point of view, the two assemblers are quite different. "Theoretically," WTDBG2 is more tolerant for divergent regions and higher error reads. Heng Li thinks it has higher likelihood to collapse the repeats. But your observation seems different from what we expect. On possibility is the genome has different degree heterozygosity and somehow the two different approaches happened to be optimized for different degrees of heterozygosity. Without seeing the data, this is just fully speculation. Do you see some mapping anomalies when you map the error-corrected ONT reads to the FALCON assembly? They can give you hint what is going on.

cschin commented 5 years ago

@mycecilia any thought, if not, I will like to close this issue

mycecilia commented 5 years ago

@cschin I just mapped the corrected reads and raw reads as well as the two assemblies to FALCON assembly. So far I haven't seen any anomaly. I'm going to try merge the two assemblies since they complement each other often in one chromosome I scanned. Right now I'm generating the heterozygosity and coverage of the aligned reads to see if there is a pattern across the genome. The error corrected reads does seem to agree with the FALCON assembly much better, although there are still a lot of deletes and inserts in them as well as the two assemblies.

I haven't gotten free cpus to run modified parameters last week. I'm going to do that today, and update here.

cschin commented 5 years ago

@mycecilia thanks. I am closing this issue for now. If there is related discussion, we can re-open it or open a new issue. Please keep me updated if possible, perhaps I can learn one or two things from this example for future improvement.

cschin / Peregrine

Parameters for plant genome #4