DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

Corrected reads as good as Hifi reads? #11

Closed dabitz closed 4 years ago

dabitz commented 4 years ago

Hi,

Thanks for the nice tool. I currently correcting ONT and Pacbio raw reads using illumina short reads and HiFi reads. Do you know whether I can use the corrected reads as input for hifi assemblers like Hicanu and Hifiasm?

Thanks André

GuillaumeHolley commented 4 years ago

Hi @dabitz,

I never tried myself to assemble Ratatosk corrected reads using HiCanu or Hifiasm. However, someone has been assembling Ratatosk corrected PacBio reads using Flye in the --pacbio-hifi mode and he was very satisfied of the assembly. So I would say that it is probably worth a shot.

Guillaume

dabitz commented 4 years ago

Hi @GuillaumeHolley Thanks a lot. I will try and see how it performs. I will get back to this post later on...

Cheers André

lileiting commented 4 years ago

Hi GuillaumeHolley,

In your paper, which option you used for flye for the Ratatosk corrected reads, --pacbio-raw or --pacbio-corr?

I have assembled the pacbio raw reads with flye --pacbio-raw and the Ratatosk corrected reads with flye --pacbio-corr. But I found the latter was not better than the former. For the raw reads with flye --pacbio-raw, I got N50 of 4.5 Mb and max contig length of 23.0 Mb. But for the Ratatosk corrected reads with flye --pacbio-corr, I only got N50 of 4.0 Mb and max contig length of 16.5 Mb. I am assembling a plant genome with genome size of more than 300 Mb.

Leiting

GuillaumeHolley commented 4 years ago

Hi @lileiting,

In the preprint we assembled our Ratatosk corrected reads with --pacbio-corr but as said previously, some people reported on Twitter they had been using --pacbio-hifi for their genome assembly of Ratatosk reads and they were very satisfied with it.

Now, I have never tried Ratatosk on plant genome reads so there is a possibility that Ratatosk does not perform well on those, although it would come as a surprise to me it performs worse than the raw reads. It is also possible that Flye created more misassemblies for the raw reads assembly compared to the corrected assembly because of the erroneous nature of the raw reads and the nature of plant genomes (aren't plant genomes very heterozygous?). Misassemblies will artificially increase the N50 and give you a false impression of better contiguity. Finally, you only reported the N50 of your assembly, which is always useful, but can be a bit of a misleading metric. Is it the contig N50 or the scaffold N50 (contig linked together with gaps, i.e, stretches of N)? By default, Flye reports the scaffold N50 I think but in your case, it makes a lot of sense to have a look at the contig N50 too. As far as I recall, you don't have a reference genome so evaluating your assembly with QUAST is out of the picture. However, since you have the short reads, you can evaluate your assembly with Merqury and report here the k-mer completeness and quality value of your assembly, without and without correction. These metrics, in combination with the contig/scaffold N50, will allow us to better compare your assemblies.