artic-network / artic-ncov2019

ARTIC nanopore protocol for nCoV2019 novel coronavirus
Creative Commons Attribution 4.0 International
168 stars 166 forks source link

Issue with coverage == 20 and low-freq variants in consensus sequence #65

Open MaestSi opened 3 years ago

MaestSi commented 3 years ago

Hi, I am running ARTIC pipeline v1.2.1, and I found that there may be an issue in case an amplicon coverage is exactly equal to 20. In that case, Nanopolish is not calling the variant (nanopolish variants --min-candidate-depth parameter defaults to 20), while artic_make_depth_mask is not masking the amplicon (probably because coverage is >= 20), therefore the consensus is not masked and the variant is not introduced erroneusly. Probably it may be better to set minimum coverage requirements strictly >20, to avoid such edge cases. What do you think? Thanks, Simone Coverage_20_issue

ampinzonv commented 3 years ago

Hi @MaestSi it makes sense, im not part of the developers but im starting to implement this protocol in my country. So i wanted to be aware of any issue. Regards.

MaestSi commented 3 years ago

Hi @ampinzonv, I would also highlight that the pipeline, in my opinion, doesn't behave how it should in case of discordant results from overlapping amplicons, since it doesn't make any difference between not genotypable regions (unknown genotype due to missing amplicon) and genotypable reference regions (genotype equal to reference). In fact, when using the --strict parameter, it removes all variants in overlapping regions if they are not supported by both overlapping amplicons, without considering if they are both genotypable or not, being very conservative. On the contrary, when the --strict parameter is not set, the variant is called if it is present in only one of the two overlapping amplicons, therefore it may also call spurious variants. One --moderately-strict parameter would be of help! :) In the following example, the variant T11654C is called and inserted in the consensus sequence since it is present in 70% reads of the amplicon from one read group, while it is not supported at all in the other read group, therefore the variant frequency in the normalized bam file is about 35%. If you consider the bam file before normalization to 400X for each amplicon, you would even notice that T11654C variant is actually present in only 3% reads! So we have some very low-frequency variants that end up being inserted in the consensus sequence.

image

Simone

MaestSi commented 3 years ago

Hi @will-rowe, I would like to have your opinion on these two scenarios, and I read you are not monitoring this repository. Thanks in advance, Simone