cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 97 forks source link

EG.5.1 sublineage with ORF1b:K2557R first detected in Guangdong, China (111 GISAID seqs as of 2023-07-21; Asia, Europe, North and South America, Australia) #2117

Closed alurqu closed 1 year ago

alurqu commented 1 year ago

There may be a EG.5.1 sublineage with ORF1b:K2557R (A21137G; NSP16:K160R) first detected in Guangdong, China.

This lineage is the parent of the lineage described in https://github.com/sars-cov-2-variants/lineage-proposals/issues/428 and is proposed based on the discussion in that pre-proposal.

As of 2023-07-20, Cov-Spectrum reports 85 good-quality (88 total) EG.5.1+ORF1b:2557R sequences directly on the EG.5.1 S:Q52H polytomy. Source: https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3AEG.5.1+%26+ORF1b%3AK2557R+%26+A22531A+%26+T22480T+%26+C25572C+%26+G14186G&nextcladeQcOverallScoreTo=29&

Note that Bloom and Neher's data https://jbloomlab.github.io/SARS2-mut-fitness/ and https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aamut_fitness_by_clade.csv shows ORF1b:K2557R aka NSP16:K160R as highly favorable in all SARS-CoV-2 clades. NSP16:K160R is also defining in lineage FL.1.5.1. There may be multiple EG.5.1 lineages with ORF1b:K2557R emerging convergently due to the favorability of ORF1b:K2557R. This proposal is specifically for the large sublineage that has emerged from the EG.5.1 S:Q52H polytomy. The other lineages may be emerging after synonymous nucleotide mutations C25572T (leading to a possible emerging EG.5.1.1 with ORF1b:K2557R), A22531G, T22480C, and G14186A (which is in the T22480C branch but is followed by a C22480T reversion on UShER).

This lineage has been reported from multiple countries in all populated continents except Africa and contains NSP16:K160R which Bloom and Neher's data https://jbloomlab.github.io/SARS2-mut-fitness/ and https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results/aa_fitness/aamut_fitness_by_clade.csv shows as highly favorable in all SARS-CoV-2 clades. NSP16:K160R is also defining in lineage FL.1.5.1.

This lineage does not have a sufficient number of samples in any one country to reliably determine growth advantage, but it is emerging in several countries with 17 sequences from Japan, 15 from the USA, 13 from Austria, 9 from Hong Kong, 6 each from China and Sweden, 5 from Australia, and 1 to 4 sequences from each of 7 other countries on CoV-Spectrum as of 2023-07-20.

As of 2023-07-20, UShER shows all of the CoV-Spectrum samples are on a single subtree with evidence of additional branching: UShER-EG 5 1+ORF1b_2557-polytomy To visualize on UShER: https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2023/07/subtreeAuspice1_genome_CoV-Spectrum_EG.5.1%2BORF1b_2557R-polytomy.json?c=gt-ORF1ab_6958&label=id%3Anode_6690627

A larger view shows this lineage and the other possible emerging convergent lineages. The lineage proposed here is the larger middle lineage: UShER-EG 5 1+ORF1b_2557-siblings To visualize on UShER: https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2023/07/subtreeAuspice1_genome_CoV-Spectrum_EG.5.1%2BORF1b_2557R-siblings.json?c=gt-ORF1ab_6958&gt=ORF1ab.6958R&label=id%3Anode_6688339

GISAID query: G21718T, T22930A, A21137G, C22480T but exclude C25572T, A22531G, and G14186A. (The EG.5.1+T22480C branch may be a reversion or an artifact.)

First GISAID Sequence: Guangdong, China 2023-05-01

Most Recent GISAID sequence: Shanghai, China 2023-07-09

A zip archive of CoV-Spectrum UShER output files for this lineage with and without its possible convergent siblings is available at Support-EG.5.1+ORF1b_2557R.zip

A CoV-Spectrum list of GISAID EPI ISLs for good-quality sequences is available at gisaid-epi-isl-EG.5.1+ORF1b_2557R.txt

FedeGueli commented 1 year ago

This lineage is the highest in my chart of EG.5.1/EG.5.1.1 sublineages: (Shown as A21137G ) Schermata 2023-07-22 alle 12 49 12 https://cov-spectrum.org/collections/181

Keep i my mind that the second one is C11779T that is likely inflated by the fact it is mainly a chinese one : (cc @Memorablea @aviczhl2 @Over-There-Is ) as we have observed when EG.5.1.1 arose it seemed to be much more than faster than what it was in the real world due the superior sequencing intensity by China.

FedeGueli commented 1 year ago

ping @corneliusroemer @InfrPopGen @thomaspeacock @AngieHinrichs when a predicted fit mutation matches the already fittest lineage it sounds we need to track it publicly, hence i suggest a rapid designation here.

aviczhl2 commented 1 year ago

I think so, another non-spike convergent evolution point.

BorisUitham commented 1 year ago

One of these sequences has 455f too EPI_ISL_17997269 | 2023-06-29 hong kong

ryhisner commented 1 year ago

Agree with @aviczhl2—NSP16_K160R (ORF1b:K2557R) has been a convergent non-spike mutation for a long time. Convergent ORF1b mutations are pretty unusual. This one could turn out to be similar to NSP9_T35I (ORF1a:T4175I), which was common in chronic-infection sequences for a long time but which was not in any major lineage until XBB.1.9 entered the picture.

I think there are a pretty large number of mutations similar to this: advantageous, but not advantageous enough to overcome the transmission bottleneck and be acquired by multiple circulating lineages. But when a widely circulating variant has one of these mutations, the modest advantage it confers becomes clear for the first time.

A similar case is NSP5_K90R (ORF1a:K3353R), and I wouldn't be surprised to see that one turn up in a fast-growing lineage at some point.

aviczhl2 commented 1 year ago

as we have observed when EG.5.1.1 arose it seemed to be much more than faster than what it was in the real world due the superior sequencing intensity by China.

Yeah there are statistical bias due to sequence density, change rates of sequence density, sampling strategy, time between collection and submission, region-level sampling.

Most of them can be somewhat addressed by de-bias techniques(except for the sampling strategy term). So I'm trying to come up with some code to de-bias this and at least help us better filter the fastest branches of the most advanced lineages.

However, I don't know how to do automated GISAID data query like cov-spectrum. I try to use LAPIS https://lapis.cov-spectrum.org/open/docs/#filter-pango-lineages but it seems that it returns differently from GISAID. For example, for 4561A,23587C,22995G, it returns only 1 seq instead of GISAID's 31.

alurqu commented 1 year ago

This lineage has been designated EG.5.1.4 in commit https://github.com/cov-lineages/pango-designation/commit/aa939921f62716535bcdc45eb9b08e9ece3eeec2, so I'm closing this ticket as complete.

FedeGueli commented 1 year ago

@corneliusroemer , milestone

FedeGueli commented 1 year ago

This lineage has been designated EG.5.1.4 in commit aa93992, so I'm closing this ticket as complete.

still not closed! check!