artic-network / artic-ncov2019

ARTIC nanopore protocol for nCoV2019 novel coronavirus
Creative Commons Attribution 4.0 International
168 stars 166 forks source link

Medaka VS Nanopolish - workflows comparison #30

Closed MaestSi closed 4 years ago

MaestSi commented 4 years ago

Hi, I basecalled SP1 data with Guppy v3.5.2 using high-accuracy model, ran artic guppyplex and then ran both Nanopolish-based and Medaka-based workflows for comparing them. I got about 7 differences over the whole genome length in the consensus sequences. The blast alignment has been split into shorter alignments due to the Ns introduced by the coverage mask, but these are the Identities of the local alignments:

Identities = 17287/17291 (99%), Gaps = 0/17291 (0%)
 Identities = 5895/5897 (99%), Gaps = 0/5897 (0%)
 Identities = 4953/4954 (99%), Gaps = 0/4954 (0%)
 Identities = 520/520 (100%), Gaps = 0/520 (0%)

And this is the full blast report: Nanopolish_Medaka_comparison_SP1.txt. Do you have a SP1 reference sequence available (e.g. sequenced with Illumina) to determine which one of the two workflows is more accurate? Secondly, I guess the default Medaka model is being used in the medaka workflow (r941_min_high_g351 in medaka 0.12.1), is there a parameter for selecting a different model?

Thanks in advance, Simone

EDIT: at a closer inspection of the differences I noticed that these are Ns present in one sequence but not in the other, so these are not genuine differences.

nickloman commented 4 years ago

Yes, you should not see any actual differences in terms of SNPs.

We use an N-masking strategy to mask uncertain regions and these may be different according to the workflow (nanopolish or medaka) used.

nickloman commented 4 years ago

There's no ability to select a different model: we don't think it matters very much, but I think I will add this for completeness in a future release.

MaestSi commented 4 years ago

Thanks. I am not a python expert at all, but I am trying to understand a little bit the code. What I still don't understand is why %s.preconsensus.fasta files generated by artic_mask already differ in terms of Ns. Since %s.coverage_mask.txt are identical between the 2 workflows but %s.fail.vcf files are not, I guess the difference in Ns is due to those vcf files. May that be the reason for the those differences? Thank you again for your time. Simone

nickloman commented 4 years ago

Yes, the Ns will be contributed due to low quality variants written out in the fail.vcf file. We are probably being a bit conservative in N-masking some of them.