cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 98 forks source link

Potential BA.5.2+ORF1b:T1050N Sublineage with C27012T, C27513T, and E:T9I reverted to E:T9T (> 2000 seqs; Japan, other Asia, North America, Europe, Australia) #1061

Closed alurqu closed 2 years ago

alurqu commented 2 years ago

There may be a BA.5.2 sublineage with ORF1b:T1050N (C16616A), C27012T, C27513T, and E:T9I reverted to E:T9T (T26270C) first detected in Utah, USA.

A simple search finds a primary subtree with several small secondary subtrees due to homoplasy, UShER issues, or branches not visible in GenBank: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-UShER To visualize on UShER: https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2022/09/singleSubtreeAuspice_genome_BA.5.2%2BORF1_1050N%2B27012T%2B27513T%2BNoE_9I_untrimmed.json?branchLabel=aa%20mutations&c=gt-E_9&label=nuc%20mutations%3aG12310A

After filtering out the small subtrees by keeping G12310A and C23854A unreverted and removing sequences with C936T, A1585G, A1587T, A1953G, C4575T, A5475G, C8127T, C11479T, G12793A, A14157G, T16023C, T17989C, T18660C, A22330G, C25528T, A25974G, and G27382C, UShER returns a single subtree: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_trimmed-UShER To visualize on UShER: https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2022/09/singleSubtreeAuspice_genome_BA.5.2%2BORF1_1050N%2B27012T%2B27513T%2BNoE_9I_trimmed.json?branchLabel=aa%20mutations&c=gt-E_9&label=nuc%20mutations:G12310A

As of 2022-09-11, Cov-Spectrum reports 2457 BA.5.2+ORF1b:1050N+27012T+27513T+E:9T sequences with good quality control scores without removing the small secondary subtrees: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-Counts Source: https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?nextcladeQcOverallScoreTo=29&variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&

After removing the small secondary subtrees, as of 2022-09-11 Cov-Spectrum reports 2424 BA.5.2+ORF1b:1050N+27012T+27513T+E:9T sequences with good quality control scores: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_trimmed-Counts Source: https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?nextcladeQcOverallScoreTo=29&variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&

As of 2022-09-11, without trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 7% compared to BA.5.2+ORF1b:T1050N in Japan and 22% in the United States: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-Growth_vs_BA 5 2+ORF1b_1050N-Japan Source: https://cov-spectrum.org/explore/Japan/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-Growth_vs_BA 5 2+ORF1b_1050N Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

As of 2022-09-11, with trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 9% compared to BA.5.2+ORF1b:T1050N in Japan and 30% in the United States: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_trimmed-Growth_vs_BA 5 2+ORF1b_1050N-Japan Source: https://cov-spectrum.org/explore/Japans/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_trimmed-Growth_vs_BA 5 2+ORF1b_1050N Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

As of 2022-09-11, without trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 9% compared to BA.5 in Japan and 36% in the United States: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-Growth_vs_BA 5-Japan Source: https://cov-spectrum.org/explore/Japan/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-Growth_vs_BA 5 Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5*&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

As of 2022-09-11, with trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 9% compared to BA.5 in Japan and 43% in the United States: BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_trimmed-Growth_vs_BA 5-Japan Source: https://cov-spectrum.org/explore/Japan/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_trimmed-Growth_vs_BA 5 Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5*&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&

Note that additional smaller subtrees may continue to arise.

CoV-Spectrum suggests that the first detection occurred in Thailand in Week 18 of 2022.

First GenBank sequence: Utah, USA 2022-06-13

Most Recent GenBank sequence: California, USA and Illinois, USA 2022-08-30

Based on Xia et al https://doi.org/10.1101/2022.02.01.478647 along with https://doi.org/10.1038/s41422-021-00519-4, the E:T9T reversion could be associated with a change in respiratory disease severity.

A zip archive of GenBank-formatted and derived metadata and FASTA files plus UShER output files for these untrimmed and trimmed sets of these sequences is available at Support-BA.5.2_ORF1b_1050N+27012T+27513T+NoE_9I.zip

corneliusroemer commented 2 years ago

Are you sure this is not just a classic reversion to reference artefact?

Some of your sentences make some alarm bells go off:

A simple search finds a primary subtree with several small secondary subtrees due to homoplasy, UShER issues, or branches not visible in GenBank:

After filtering out the small subtrees by keeping G12310A and C23854A unreverted and removing sequences with C936T, A1585G, A1587T, A1953G, C4575T, A5475G, C8127T, C11479T, G12793A, A14157G, T16023C, T17989C, T18660C, A22330G, C25528T, A25974G, and G27382C, UShER returns a single subtree:

27513T is normal in BA.5. This showing up in so many countries suggests artefact. Have you looked at how clean these sequences are e.g. in Nextclade?

alurqu commented 2 years ago

@corneliusroemer Can you point me to documentation on how to determine how clean the sequences are using Nextclade? So far I'm filtering by the overall quality control status in the Nextstrain metadata and on the CoV-Spectrum queries.

Regarding 27513T, the UShER tree showed it after ORF1b:T1050N in BA.5.2. If you use the UShER visualization link under the untrimmed tree above and check, the first branch after ORF1ab:T5451N is 27513T. Then the next branch is 27012T. As a note for other readers: ORF1ab:T5451N is an alias for ORF1b:T1050N.

For the frequency of 27513T in BA.5, as of early 14 September 2022 European time and using Nextclade lineage assignments, CoV-Spectrum shows 113621 [BA.5+27513T](https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?variantQuery=nextCladePangoLineage%3ABA.5*+%26+27513T&) sequences. Of these, 111671 are BA.5.2*+ORF1b:T1050N while 1950 are not BA.5.2*+ORF1b:T1050N. So over 98% of BA.5+27513T is on the BA.5,2+ORF1b:T1050N branch. Seeing 27513T in these sequences is not an anomaly.

Considering whether this E:I9T reversion might be an artefact, while the tree is smaller similar structure appears in the single subtree view for BA.5.2*+ORF1b:1050N+27513T+27012T+S:346I. Would you consider S:346I as a likely artefact as well? BA 5 2+ORF1b_1050N+S_346I_27012T+27513T_untrimmed_20220913

Additionally, with the large number of countries that have found these sequences there is a diversity of labs performing the sampling. Why would the different labs start producing the same artefact, and why would that occur first in one country and then later in others? Even if the labs were using a problematic primer, would it show that pattern? Or might that pattern be more consistent with a highly-transmissible virus transmitted through international travel?

Edit: Corrected "not BA.5.2*+ORF1b:T1050N" link.

alurqu commented 2 years ago

If I read the primer information at https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019 correctly, the ARTIC V4 and V4.1 SARS-CoV-2 primer SARS-CoV-2_88_LEFT covers 26255 to 26277 which includes E:T9's 26269 to 26271 and especially 26270 where an E:T9 reversion would be seen. If I'm correct, this seems to increase the probability that this reversion is a sequencing artefact.

FedeGueli commented 2 years ago

the s:346I mentioned above has been already proposed? @alurqu

alurqu commented 2 years ago

@FedeGueli Since you asked, I've now opened Issue 1074 for the biggest BA.5.2+ORF1b:T1050N+S:R346I branch that I mentioned above. I'll note that there may be other BA.5.2+ORF1b:T1050N+S:R346I branches developing.

alurqu commented 2 years ago

@corneliusroemer Based on your comment above, I ran the larger untrimmed sequence set through Nextclade (which I had previously stopped using for some reason). Using that tool I get a single subtree:

BA 5 2+ORF1b_1050N+27012T+27513T+NoE_9I_untrimmed-nextclade_tree zoomed in from https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2022/09/nextcladeAuspice_BA.5.2%2BORF1b_1050N%2B27012T%2B27513T%2BNoE_9I_untrimmed.json?branchLabel=aa%20mutations&c=gt-E_9&label=nuc%20mutations:G12310A

I have the Nextclade results zip file and attempted to attach it here, but it was too large for GitHub. Please let me know if one or one files from the Nextclade results would be helpful.

FYI I've been using UCSC UShER to check the phylogenetic tree. Something about these sequences may be causing difficulty for UCSC UShER.

As to whether Nextclade shows these sequences to be "clean", most of the sequences show all good (green) on the quality control checks although 3 show mediocre (orange) on the "Missing Data" check and 11 others show mediocre (orange) on the "Private Mutations" check.

corneliusroemer commented 2 years ago

This is pretty clearly an artefact. Just defined by a reversion - which appears distributed over lineages in that part of the tree.

Will close as a result. I recommend to stay away from reversions as lineage defining mutations, they are almost always an artefact.

image

Also please don't post so many growth advantages - they clutter the proposal and make it unreadable without making it more convincing.