cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 97 forks source link

Potentially new BA.2 Sublineage with S:A27P mostly in Washington State, USA [112 seqs] #555

Closed alurqu closed 2 years ago

alurqu commented 2 years ago

BA.2 sequences with additional mutation S:A27P (nuc G21641A) have appeared mostly in Washington State, USA with the first sequence 2022-02-07 and the most recent sequence from 2022-03-23. On Cov-Spectrum at the time of issue creation, 112 sequences show lineage BA.2 with 34 additional BA.2.3 and 1 additional BA.2.2 showing this mutation. Of the strictly-BA.2 sequences, 106 are from Washington State, 5 are from Finland, and 1 is from Denmark. Of the broader BA.2* set, 32 are from Washington State and 3 are from Finland.

The Cov-Spectrum URL for the narrower set is https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?aaMutations=S%3A27P&pangoLineage=BA.2&, and the Cov-Spectrum URL for the broader set is https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?aaMutations=S%3A27P&pangoLineage=BA.2*&

The S:A27P mutation has occurred 46 times scattered across other lineages but never at this frequency.

For all NCBI GenBank sequences accessible through Cov-Spectrum, S:A27P occurs in 0.0037%. However, for sequences since 01 March 2022, S:A27P occurs in 0.24% for a more than 64x frequency increase in recent sequences.

Note: any coaching regarding additional vetting of this and other potential sublineages, such as how to properly check for a monophyletic clade, will be appreciated.

CoV-Spectrum sequence lists for the narrower and broader sets are attached. BA.2+S_A27P-cov-spectrum-contributors.csv BA.2star+S_A27P-cov-spectrum-contributors.csv

FedeGueli commented 2 years ago

Here the tree with some of the last WA sequences. https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/singleSubtreeAuspice_genome_1e871_67bc80.json?branchLabel=Spike%20mutations&c=gt-S_27&label=nuc%20mutations:T670G,C2790T,G4184A,C4321T,C9344T,A9424G,C9534T,C9866T,C10198T,G10447A,C12880T,T15240C,C15714T,C17410T,C19955T,A20055G,C21618T,T21762C,T21846C,T22200G,C22673T,A22688G,G22775A,A22786C,A24130C,C26060T,C26858T,G27382C,A27383T,T27384C,A29510C

Weird i cannot find S:27P

is this masked on Usher?

@AngieHinrichs @corneliusroemer @chrisruis @tompeacock

Addendum: Analyzing this issue i noticed that in Denmark 1/6 of BA.2 sequences have S:27A (reverted to wt). is this real or is it just backfilling to reference?

(https://cov-spectrum.org/explore/Denmark/AllSamples/Past3M/variants?aaMutations=S%3A27A&pangoLineage=BA.2*&)

thomasppeacock commented 2 years ago

Just had a look to check this wasnt an artefact caused by different alignments of the out of codon sync deletion that BA.2 has - it does appear there is an additional nucleotide change (T21632C OR G21641C) though compared to normal BA.2 but I think analysis software is going to struggle to pick this up because its adjacent to the (lineage defining) deletion makes it ambiguous.

Running the sequences through Usher although a few do clsuter together overall they fall quite scattered throughout BA.2: image I do wonder whether this might be some sort of sequencing/bioinformatics artefact still rather than a real mutation because of this lack of clustering. https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/singleSubtreeAuspice_genome_3419f_6a4240.json?c=userOrOld&label=nuc%20mutations:T670G,C2790T,G4184A,C4321T,C9344T,A9424G,C9534T,C9866T,C10198T,G10447A,C12880T,T15240C,C15714T,C17410T,C19955T,A20055G,C21618T,T21762C,T21846C,T22200G,C22673T,A22688G,G22775A,A22786C,A24130C,C26060T,C26858T,G27382C,A27383T,T27384C,A29510C

silcn commented 2 years ago

Weird i cannot find S:27P

is this masked on Usher?

Search for nuc:21632C instead and you'll find it. The BA.2 deletion is from 21633-21641, so the substitution (if it's not an artefact) is definitely 21632C and not 21641C. But Nextclade and cov-spectrum don't know this when they see the sequences, so they when they see BA.2 with 21632C they instead call it as a deletion from 21632-21640 with 21641C.

Usher correctly calls this as 21632C, but because this is in codon S:24 rather than S:27, it isn't interpreted correctly; you can't find it by searching for S:24 either because of the deletion. But the nucleotide change is still there!

This is what normal BA.2 shows:

A27S

And this is what Nextclade incorrectly shows for these sequences:

A27P

Addendum: Analyzing this issue i noticed that in Denmark 1/6 of BA.2 sequences have S:27A (reverted to wt). is this real or is it just backfilling to reference?

(https://cov-spectrum.org/explore/Denmark/AllSamples/Past3M/variants?aaMutations=S%3A27A&pangoLineage=BA.2*&)

Those sequences are missing the 9-nucleotide deletion, which has the effect of reverting S:27 to WT. They're spread across multiple sublineages, so clearly not real.

FedeGueli commented 2 years ago

thank you very much @silcn for the double explanation. . Now it makes sense!

corneliusroemer commented 2 years ago

Good explanation @silcn. I'll close this for now unless we have strong evidence this is not an artefact.

AngieHinrichs commented 2 years ago

Thanks @thomasppeacock and @silcn. And just to confirm:

Weird i cannot find S:27P

is this masked on Usher?

yes, nucleotides 21633-21641 are masked in BA.2 in the UShER tree due to the BA.2 deletion and the general problem with some genome assembly pipelines reporting "substitutions" at deleted sites.