artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 68 forks source link

Edge where N being called instead of reference bp #31

Open Takadonet opened 4 years ago

Takadonet commented 4 years ago

tdlr : Edge case where N being called in position where sufficient coverage in pileup for reference base pair but being missed since nanopolish variant called low quality variant and being filtered out in downstream analyses.

We examine our consensuses calling against the pileup and noticed 3 positions ( 7203 ; 24,891 ; 24,892 ) where N's were present when sufficient coverage was found in pileup.

Example below.

image

In the case above, C should have been called since 99% of the reads matched, however nanopolish variant indicate there was an insertion from a C to CT . As indicate from vcf file below from read group nCoV-2019_2 as seen here (to my limited knowledge, read group nCoV-2019_2 is primer set of reads)

Based on the info column from that entry, looks like nanopolish has a completely different view of is in the pileup. I am sure there is post quality filtering step happening.

In any case, it does seems to be an edge case since downstream that variant is filtered into fail.vcf and then into artic_mask to be masked as N base pair. Would it not make more sense to check failed variant to see if there enough evidence to call the reference base pair?

Perhaps artic_mask command can be modified to include checking every reference base pair for depth of coverage to ensure it can be called? Indirectly calling non-variants.

What do you think?

nickloman commented 4 years ago

Thanks for the report. Firstly please can you check you are using latest nanopolish, as indel behaviour has been slightly adjusted in recent times.

So from what I can see, nanopolish is reporting:

This means that 16% of the basecalled reads support an insertion at this position, but the nanopolish model believes that actually 60% of reads support an insertion.

So it could well be that there is an insertion at this position (at some frequency).

So I think the behaviour to mask that region with Ns is intended, particularly as there is uncertainty there.

You can turn off indel calling with the --no-indels flag to artic minion if you desire.

@jts may have an opinion on this too.

jts commented 4 years ago

Hi @Takadonet ,

The QUAL score in the VCF record is very low relative to the number of reads at the position (133.0, 364 reads) so this variant fails our filters. I think this particular variant is a calling artifact and filtering it is the correct behaviour.

We consider positions that fail our filters as being unreliable to make a definitive call at, so we mask them with an N so they don't confuse phylogenetic analysis.

Thanks for pointing this variant out, I recently tweaked the indel calling behaviour so I appreciate the feedback. I'll keep on an eye these spurious insertions.

Jared

Takadonet commented 4 years ago

Thanks for quick reply!

We are currently on version 0.13.0 for nanopolish (latest env.yml from artic-ncov2019) and see that you just updated a few days ago to 0.13.1 in this repo.

Will make a new conda env and will report back the results at theses positions.

Sine there such a great disparity between what end user can review in the IGV/Tablet and what nanopolish is calling, is there another method to review base pair being called?

Anyway to address their concerns of the differences? Since there are over 10k samples in gisaid now, no point rushing to be first and more worried of submitting quality and reviewed consensus sequences.

nickloman commented 4 years ago

You could try the medaka pipeline and compare with that?

I am confused about the difference in basecalled frequency (16%) and the IGV view which suggests only 2% (8 insertions at depth 395) though. Would you be able to share the BAM file with me (perhaps privately?).

Takadonet commented 4 years ago

I will try out medaka as well and see how they differ.

I am confused as well about the differences and that why seeking for help. Just asking permission before sharing the BAM file with you privately.

Takadonet commented 4 years ago

Sent bam file link to your email @nickloman

nickloman commented 4 years ago

Thanks. So from what I can see from the alignment there is some support for that variant (judging from Tablet), but not sufficient for it to be called a variant. I agree you could confidently call reference there. One way of doing that easily is to add '--no-indels' to artic minion (but be cautious, you won't get any indels with that). I might add another optional argument to the pipeline to let it drop those low confidence variants instead of masking it.

Takadonet commented 4 years ago

Thanks for taking a look. Using medaka beta addressed this particular issue for these strains but failed for others. We are investigating what is going on.

For moment we are just going to fix them 'by hand' since only 3 SNV's across a dozen strains but in future will use the optional argument when it is available.