Closed mkedwards closed 3 years ago
I mean, honestly, I think the way that the N:R203M (B.1.617) clade has been split up is kind of busted. There are three or four loosely related subclades in there (none of which is B.1.617.3, which is more or less noise); and a healthy chunk of what you're labeling "B.1.617.2" is more closely related to B.1.617.1. Try slicing by ORF1b polymorphisms and you'll see what I mean.
https://nextstrain.org/ncov/asia?branchLabel=aa&c=gt-ORF1b_2310,662,1000,2285,1570>=N.203M&m=div
It's pretty clearly the N:G215C subclade that's exploding in the UK.
T3646A might be a better or worse alternative to del3675/3677 in other major variants. On the DeepMind Nsp6 prediction, the 106-108 deletion (yellow) is in a loop and T77A (red) is in the middle of residues 60-98 (green) that are fairly conserved. Mutations in immunocompromised patients (orange) also contact this region and it's not hard to guess how L37F could play a similar role if any of this is significant (and if the predicted structure is close to accurate).
T95I was also added and this subclade and T95I is now ~half UK sequences; India less well sampled but similar. I guess that increase in these mutations has something to do with Nepal/Sri Lanka/Maldives/Seychelles increasing transmission so quickly simultaneously (on top of what you'd expect from increased imports from India). It should be useful to know what if any transmission advantage there is since countries have different starting points depending on April/May travel policies; extrapolating to 100% Delta to decide near-term policies is difficult otherwise.
Also maybe of interest is N:G215C changing to N:del214/215. This is at the root of one of the large Singapore clusters and now it looks like a significant cluster in England starting 23/May and 34 sequences already.
Ping? This subclade (as defined by the ORF1b SNPs above) is now up to something like 97% of Delta in the UK.
I guess with AY.1 being a sublineage of this it's a bit complicated to sort out the designations and it seems to be getting more complicated quickly.
Here's a toy model I made of how Nsp6 mutations might be relevant based on DeepMind structure predictions; L37, T77, and SGF106-108 all could plausibly play a role in the conformation of a methionine rich region (yellow), which could be sensitive to oxidative stress to gate double membrane vesicles in a hexamer pore. Several K and one R outside (purple) form a possible RNA binding regions (also some conservation here). An Nsp4 dodecamer outside is more speculative, but dimensions are roughly consistent with MHV EM data (Wolff et al. DOI: 10.1126/science.abd3629), fit the Nsp6 hexamer well, and surface charge is consistent with typical ER transmembrane protein polarity.
I've run out of relevant expertise at this point and the learning curve is steep for either of those, so if anyone finds this plausible and wants to pursue this lmk and I'll send PDBs; maybe interesting to refine the structure with weighted coarse-grained simulations or look for co-evolution at possible binding interfaces.
Sorry this isn't particularly relevant to Pango designations; I think possibly-relevant non-spike mutations aren't in the picture as much as they maybe should be.
Add S:EFR156–158G (which seems to be standard for Delta; it must have gotten lost when I was splitting out the subclade changes from base B.1.617.2) and S:T95I (which is common but not universal), and you get the strain that I am guessing hit the Nos Tayons retirement home in Nivelles, Brabant Wallon, Belgium: https://infodujour.fr/societe/50183-douze-morts-du-covid-en-belgique-tous-etaient-vaccines
(Check out the ages of the patients in the two clusters. I don't have independent confirmation that these are the Nos Tayons samples, but the location and dates appear to check out.)
@zach-hensel the Wolff et al. paper is excellent science, and it's great to see you building on it. Note that these pores are also the sites of vRNP formation, and I'd think there could be some functional coupling between nsp6 and N mutations. One would expect N:R203M, and possibly also N:G215C, to modify the phosphorylation dynamics in the SR-rich region and therefore also the bindings to various 14-3-3 isoforms (https://www.biorxiv.org/content/10.1101/2020.12.26.424450v2.full). I don't think it's yet know how the N dimers are assembled into RNPs, or even what the exact stoichiometry of an RNP is, though there's some progress on these fronts ( https://onlinelibrary.wiley.com/doi/10.1002/pro.3909, https://www.cell.com/cell/fulltext/S0092-8674(20)31159-4 ). D377Y (which has been repeatedly reinvented) is in the C-tail ("spacer B / N3") region and may affect the formation of these higher-order N assemblies. Any insight you might have into how these mutations relate to these structures, and what motifs in the nsp4/nsp6 transmembrane pore might interact with them, would be of great interest.
(Also compare the SARS-CoV-2 tomograms in https://www.nature.com/articles/s41467-020-19619-7 to their analogues in MHV. It's a pity they didn't capture DMV pores or vRNP formation.)
N:del214/215 sequences have more than tripled in UK since I posted 9 days ago and is a majority of recent sequences in Singapore so I think responsible for the current large cluster. Could very well be noise but something to keep an eye on. Also possibly relevant is G214C in C.37; anything with an advantage in Peru is notable though the giant S deletion is more notable. Possibly disrupting phosphorylation confers a small advantage based on the number of (occasionally large) deletions and mutations between S:202 and S:215.
Edit: 1 week later a few more N:del214/215 sequences but no sign of growth relative to everything else in the UK the past few weeks.
Just a little update on this subclade: it was mentioned in the PHE technical briefing 19: "the phylogeny of Delta in the UK, which is dominated by a large distinct clade. The clade has distinguishing mutations outside spike with uncertain biological significance, including NSP3: A488S, P1228L, P1469S; NSP4: D144D, V167L, T492I; NSP6: T77A, V120V; NSP14:A394V; ORF7b: T40I; N: G215C. The dominance of this clade in the UK may relate to epidemiological or biological effects or both. Further investigations are being undertaken"
It was weird to see this list missing T95I (I think, but not certain, in most but not all of this) yet S:T95I mentioned for the new VOI given that it's in that VOI and some others. I think not very important but also not unimportant.
The clade has distinguishing mutations outside spike with uncertain biological significance, including NSP3: A488S, P1228L, P1469S; NSP4: D144D, V167L, T492I; NSP6: T77A, V120V; NSP14:A394V; ORF7b: T40I; N: G215C.
Note also that this subclade has become dominant in the sequences coming out of Bangladesh, which has been having a terrible surge in recent weeks. I don't think this is just a founder effect.
Yes, there seems to be a major division within "Delta" variant. Congratulations for noticing it early. It can be shown on phylogenetic trees and by clustering based on Jaccard distance between sets of mutations. The sub-lineage which you noticed seems to be growing much faster than the rest of "Delta" in every location we checked.
Here is the % of this sub-lineage within all "Delta" sublineages
What we tried was clustering of all Delta genomes from GISAID based on the Jaccard distance between their combinations of mutations (we included only protein-altering mutations). We obtained this:
And then checked the mutations most characteristic for the 2 major clusters there (the bigger one is Cluster 1)
These are very similar to what was already discussed here so would also say that there is a strong "case" for defining a new lineage containing a big part of the current B.1.617.2 (and probably AY.1 and some of AY.3 ?)
Hi @lukaszgit thank you for your analysis: the cluster 2 in your table has been previously proposed as lineage in ISSUE #112 and currently it is still open.
Hi, thanks ! I did not notice #112
Yes, definitely current "Delta" lineages can be improved. It also probably suggests the importance of mutations in proteins other than Spike.
Thanks all for the submission and discussion. We're currently working on fitting all of B.1.617.2 into the Pango rules and will hopefully have an update soon. I'll close this one for now but potentially link back in once we have the update
New lineage proposal
by Michael K. Edwards m.k.edwards@gmail.com
Description
Sub-lineage of: B.1.617.2 Earliest sequence: collected 2020-11-21 Most recent sequence: Now. Countries circulating: Global.
I think this subclade (with N:G215C) ought to be tracked separately from the rest of B.1.617.2. It appears to have been later in spreading outside India but may be spreading considerably faster where it's introduced.
Appears to have originated in India, perhaps specifically Uttar Pradesh.
The closest thing genomically to the founding strain in Nextstrain's Asia subsample appears to be GISAID EPI ISL 2272809; the earliest sample I've seen (which has some other drift) is 2373501 (collected 2020-11-21). Here are the visible changes specific to the variant:
N: G215C ORF1a: A1306S, P2046L, P2287S, V2930L, T3255I, T3646A ORF1b: G662S, P1000L, A1918V ORF7b: T40I plus A11332G (synonymous in nsp6)
on top of the set of mutations that appear to define B.1.617.2: M: I82T N: D63G, R203M, D377Y ORF1b: P314L ORF3a: S26L ORF7a: V82A, T120I ORF9b: T60A S: T19R, EFR156–158G, L452R, T478K, D614G, P681R, D950N
Note that there is some calling weirdness at the end of ORF8 in a lot of B.1.617.2, and I'm leaving that out.
Genomes
Easy enough to extract from a Nextstrain search. Note that S:G142D is some kind of primer-related calling noise. What seems to define this subclade is a series of drift-y mutations followed by an explosion after the discovery of ORF1a: P2046L,T3646A. Note that T3646A is also present in B.1.617.1. I am not wholly convinced that all of B.1.617.2 shares a common ancestor that B.1.617.1 doesn't. Then again, P2046L has also been discovered in multiple unrelated clades; it's one of the defining mutations of B.1.351.3. And C.36 also has ORF1a:P2287S, and T3255I is in several other clades, including B.1.1.519. I'm starting to think that all four of these belong in the "fitness-increasing short putts from almost anywhere" category along with S:E484K and S:N501Y.
So if you want something really clade-defining, I'd filter on nucleotide A11332G. N: G215C is also a good candidate, and is right there in the region that seems to most sensitively affect phosphorylation dynamics of the SR-rich linker — so it might be the real functional change. (Take that, Spike-centric analyses!)
Evidence
https://nextstrain.org/ncov/asia?branchLabel=aa&c=gt-N_215&f_pango_lineage=B.1.617.2&m=div
Observe the explosion in both sample count and genomic diversity after the discovery of ORF1a: P2046L,T3646A.
Proposed lineage name
B.1.617.5?