cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 98 forks source link

GW.5 sublineage with ORF3a:W149C (>20 seq) and a further sublineage with ORF1a:P62L, S:F79S, S:A475V, ORF3a:G172C, ORF6:L61H (4 seq. UK, USA, Italy) #2277

Closed oobb45729 closed 1 year ago

oobb45729 commented 1 year ago

Defining mutations: GW.5 then ORF3a:W149C(G25839T)+nuc:A28768G 1 https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_1085f_cf73f0.json?c=gt-ORF3a_149&gmax=26220&gmin=25393&label=id:node_6813488 2

FedeGueli commented 1 year ago

Query: C450T, G25906T, G25839T

it is a quinquesmutant in the RBD having 455-456-475-478-554 mutated . cc @corneliusroemer

corneliusroemer commented 1 year ago

Yes I've seen this before but was far too small then. Let's wait for a few more genomes.

ryhisner commented 1 year ago

There is a sequence from Pakistan that belongs in the smaller sublineage. The coverage is horrendous, but it has C450T, T21798C, G25839T, and G25906T. Pakistan or a nearby area seems a very likely place of origin for this. EPI_ISL_18122626

Also, while I'd normally dismiss changes to ORF6:D61L as artifacts, there may be something going on here. The Pakistan sequence is a total blank, but two sequences from England, one from Italy, and one from California (USA) all show something strange happening there. They could all four be artifacts, but you don't usually see artifacts from completely different labs like that unless there's something unusual in that part of the genome.

ryhisner commented 1 year ago

The ORF3a:E239G/ORF1a:E1815V branch is also really interesting and worth watching. It has N:A90S/ORF9b:E86D, which is homoplasic and I think likely advantageous, and its two major branches have either S:G184S or S:D215H. (The sequence from California has terrible coverage but has an "unknown AA" at N:90/ORF9b:86, so it also has that mutation.)

And with sequences from England, Denmark, Spain, California (USA), and Michigan (USA), it's geographic spread is striking. I think all these branches originated in Pakistan or a nearby region and are only now beginning to spread across the globe.

image
corneliusroemer commented 1 year ago

Yes, definitely worth designating once we have enough sequences for the placement to be stable

HynnSpylor commented 1 year ago

the S:A475V branch: Pakistan +3 now Usher shows the total is 8 (Pakistan 3, UK 3, Italy 1, US 1) QQ截图20230911160846 GW.5 with ORF3a:W149C is also partially mentioned in https://github.com/sars-cov-2-variants/lineage-proposals/issues/640

NkRMnZr commented 1 year ago

That recurring G27382C, T27383A_reversion, C27384T_reversion, ins_27384C(ORF6:D61H) is something noticed sars-cov-2-variants/lineage-proposals#560 , no idea what happened out there, or some kind of artifact just discovered? ins

BTW it is causing havoc in the subtree (downstream ORF6 frameshift, messing the start codon of ORF7a etc)

oobb45729 commented 1 year ago

That recurring G27382C, T27383A_reversion, C27384T_reversion, ins_27384C(ORF6:D61H) is something noticed sars-cov-2-variants/lineage-proposals#560 , no idea what happened out there, or some kind of artifact just discovered? ins

BTW it is causing havoc in the subtree (downstream ORF6 frameshift, messing the start codon of ORF7a etc)

I think I figured out what happened there. It's an insertion of 'A' between 27382 and 27383.

GAT->CTC->CATC

CoV-Spectrum interprets it as G27382C and ins_27384:C.

It causes the loss of the ORF6's stop codon as well.

T27395A(ORF7a:M1K) actually causes an AA change in the extended ORF6.

@ryhisner @NkRMnZr

FedeGueli commented 1 year ago

That recurring G27382C, T27383A_reversion, C27384T_reversion, ins_27384C(ORF6:D61H) is something noticed sars-cov-2-variants/lineage-proposals#560 , no idea what happened out there, or some kind of artifact just discovered? ins BTW it is causing havoc in the subtree (downstream ORF6 frameshift, messing the start codon of ORF7a etc)

I think I figured out what happened there. It's an insertion of 'A' between 27382 and 27383.

GAT->CTC->CATC

CoV-Spectrum interprets it as G27382C and ins_27384:C.

It causes the loss of the ORF6's stop codon as well.

T27395A(ORF7a:M1K) actually causes an AA change in the extended ORF6.

@ryhisner @NkRMnZr

Yes i think also @NkRMnZr figured out the insertion thing. But so it would create a new Orf6/7a protein? Just checking where the first stop codon comes then. @ryhisner suggested it could be an artifact.

FedeGueli commented 1 year ago

The next stop codon TGA should be between orf7a:15-16

FedeGueli commented 1 year ago

The next start codon ATG is the one of Orf7b but it is not in frame with the new protein , i have not the expertise to say if this could be functional or not. @ryhisner could tou check please

ryhisner commented 1 year ago

I found a strange deletion in a couple sequences this morning, and when I looked for similar sequences, I was led to GW.5.1.1. I think it's possible that ORF7a, ORF7b, and ORF8 are all deleted in the S:F79S branch—something like ∆27395-28246. Furthermore, if this really is a deletion, as I suspect, it also creates an extra TRS for ORF9b/N. I consider it to be the 3rd TRS for ORF9b/N due to what I view as two overlapping TRS's already present. The sequence formed from the end of ORF6 to the start of N would be as pictured below. The BLAST and Nexclade alignments agree (though there are a number of sloppy sequences that seem to butcher everything).

image

Having a 3rd TRS for ORF9b/N would not be entirely unprecedented as Gamma had an AACA insertion in the ORF9b/N TRS that created three overlapping TRS motifs. .

image

.

I searched for all sequences with G25839T, G25906T, and C12473T and found 129. Almost all were total blanks in ORF7a, ORF7b, and ORF8 in Nextclade, and when I uploaded the NextClade alignment fasta to AliView, the NNN's very closely line up in most sequences, including sequences from numerous different countries, as seen below. (Almost all the NNN's left out of the picture because there are way too many to show.)

image

. The BLAST alignments are basically identical to the Nextclade ones. Below is an example. Query is a GW.5.1.1 sequence and Sbjct is the Wuhan reference genome.

image

. Maybe there's something entirely different going on here, but a deletion spanning approximately 27395-28246—where nearly every GW.5.1.1.1 has NNN's—is the simplest explanation I can think of. A consistent theme of SARS-CoV-2 evolution has been mutations that increase transcription of N/ORF9b, so the additional TRS for N/ORF9b fits that pattern. ORF8 has of course been almost a non-factor since the rise of BA.5, which had almost no ORF8 expression. The vast majority of XBB of course have ORF8:G8 while XBC.1 almost certainly have virtually no ORF8 expression due to a TRS-ablating mutation and ORF8:K2T, which can interfere with transcription.

Furthermore, large deletions, stop codons, and frameshifts leading to stop codons in both ORF7a and ORF7b have been relatively common in the Omicron era. My hypothesis is that ORF6 and ORF7a/ORF7b have redundant functions, so that as long as one is fully functioning, the other is disposable. Krogan Lab has done great work showing that ORF6:D61L severely reduces the ability of ORF6 to combat the interferon response, which it does by blocking the imports to and exports from the cell nucleus. https://www.biorxiv.org/content/10.1101/2022.10.18.512708v2

image

.

As long as ORF7a/b is fully functioning, ORF6:D61L seems tolerable. There are three mutations that more or less destroy the ORF7a TRS: C27389T, G27390T, and C27393T. These mutations have recurred throughout the pandemic, but they have been far less common in periods during which ORF6:D61L was predominant. Graphs below, which I've overlapped by making one semi-transparent, are from CovSpectrum. The scales of the y-axes are very different but the trends are apparent.

image

. It's more difficult to search for large deletions, but below was my attempt to compare the prevalence of large ORF7a deletions with ORF6:D61L. It doesn't catch frameshifting deletions smaller than 20 nt.

image

. For reasons that are totally unclear to me, ORF7a deletions and ORF7a-TRS-destroying mutations have been FAR more common in South Africa than elsewhere—something like 20 times more common. In the graph below, I combined the ORF7a-TRS destroyers with the large ORF7a deletions but for South Africa instead of globally. Note that unlike in the previous, global graph, the y-axes are aligned.

image

. Looking at the BLAST and Nextclade alignments, there seems to be no sign of ORF6:D61L. This would restore ORF6 to its previous level of potent innate immune evasion, which would, in turn, make both ORF7a and ORF7b disposable. So all in all, if this huge string of NNN's does turn out to be an enormous deletion, as I suspect, it makes sense to me. The renewal of ORF6 through eliminating ORF6:D61L make ORF7a redundant. ORF8 was already non-functional and just taking up space. And the additional TRS for ORF9b and N, both known potent innate immune antagonists, would both satisfy the virus's unquenchable thirst for more N/ORF9b transcription and possibly make ORF7a/ORF7b even more disposable to boot.

FedeGueli commented 1 year ago

Wow Ryan! put together a paper on this please ! You did a wonderful work! and dont forget that the virus tried this already with success in XBC.1.3 where Orf7a Orf7b Orf8 are completely messed up see for reference this: https://github.com/sars-cov-2-variants/lineage-proposals/issues/542