Closed alurqu closed 2 years ago
Are you sure this is not just a classic reversion to reference artefact?
Some of your sentences make some alarm bells go off:
A simple search finds a primary subtree with several small secondary subtrees due to homoplasy, UShER issues, or branches not visible in GenBank:
After filtering out the small subtrees by keeping G12310A and C23854A unreverted and removing sequences with C936T, A1585G, A1587T, A1953G, C4575T, A5475G, C8127T, C11479T, G12793A, A14157G, T16023C, T17989C, T18660C, A22330G, C25528T, A25974G, and G27382C, UShER returns a single subtree:
27513T is normal in BA.5. This showing up in so many countries suggests artefact. Have you looked at how clean these sequences are e.g. in Nextclade?
@corneliusroemer Can you point me to documentation on how to determine how clean the sequences are using Nextclade? So far I'm filtering by the overall quality control status in the Nextstrain metadata and on the CoV-Spectrum queries.
Regarding 27513T, the UShER tree showed it after ORF1b:T1050N in BA.5.2. If you use the UShER visualization link under the untrimmed tree above and check, the first branch after ORF1ab:T5451N is 27513T. Then the next branch is 27012T. As a note for other readers: ORF1ab:T5451N is an alias for ORF1b:T1050N.
For the frequency of 27513T in BA.5, as of early 14 September 2022 European time and using Nextclade lineage assignments, CoV-Spectrum shows 113621 [BA.5+27513T](https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?variantQuery=nextCladePangoLineage%3ABA.5*+%26+27513T&) sequences. Of these, 111671 are BA.5.2*+ORF1b:T1050N while 1950 are not BA.5.2*+ORF1b:T1050N. So over 98% of BA.5+27513T is on the BA.5,2+ORF1b:T1050N branch. Seeing 27513T in these sequences is not an anomaly.
Considering whether this E:I9T reversion might be an artefact, while the tree is smaller similar structure appears in the single subtree view for BA.5.2*+ORF1b:1050N+27513T+27012T+S:346I. Would you consider S:346I as a likely artefact as well?
Additionally, with the large number of countries that have found these sequences there is a diversity of labs performing the sampling. Why would the different labs start producing the same artefact, and why would that occur first in one country and then later in others? Even if the labs were using a problematic primer, would it show that pattern? Or might that pattern be more consistent with a highly-transmissible virus transmitted through international travel?
Edit: Corrected "not BA.5.2*+ORF1b:T1050N" link.
If I read the primer information at https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019 correctly, the ARTIC V4 and V4.1 SARS-CoV-2 primer SARS-CoV-2_88_LEFT covers 26255 to 26277 which includes E:T9's 26269 to 26271 and especially 26270 where an E:T9 reversion would be seen. If I'm correct, this seems to increase the probability that this reversion is a sequencing artefact.
the s:346I mentioned above has been already proposed? @alurqu
@FedeGueli Since you asked, I've now opened Issue 1074 for the biggest BA.5.2+ORF1b:T1050N+S:R346I branch that I mentioned above. I'll note that there may be other BA.5.2+ORF1b:T1050N+S:R346I branches developing.
@corneliusroemer Based on your comment above, I ran the larger untrimmed sequence set through Nextclade (which I had previously stopped using for some reason). Using that tool I get a single subtree:
I have the Nextclade results zip file and attempted to attach it here, but it was too large for GitHub. Please let me know if one or one files from the Nextclade results would be helpful.
FYI I've been using UCSC UShER to check the phylogenetic tree. Something about these sequences may be causing difficulty for UCSC UShER.
As to whether Nextclade shows these sequences to be "clean", most of the sequences show all good (green) on the quality control checks although 3 show mediocre (orange) on the "Missing Data" check and 11 others show mediocre (orange) on the "Private Mutations" check.
This is pretty clearly an artefact. Just defined by a reversion - which appears distributed over lineages in that part of the tree.
Will close as a result. I recommend to stay away from reversions as lineage defining mutations, they are almost always an artefact.
Also please don't post so many growth advantages - they clutter the proposal and make it unreadable without making it more convincing.
There may be a BA.5.2 sublineage with ORF1b:T1050N (C16616A), C27012T, C27513T, and E:T9I reverted to E:T9T (T26270C) first detected in Utah, USA.
A simple search finds a primary subtree with several small secondary subtrees due to homoplasy, UShER issues, or branches not visible in GenBank: To visualize on UShER: https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2022/09/singleSubtreeAuspice_genome_BA.5.2%2BORF1_1050N%2B27012T%2B27513T%2BNoE_9I_untrimmed.json?branchLabel=aa%20mutations&c=gt-E_9&label=nuc%20mutations%3aG12310A
After filtering out the small subtrees by keeping G12310A and C23854A unreverted and removing sequences with C936T, A1585G, A1587T, A1953G, C4575T, A5475G, C8127T, C11479T, G12793A, A14157G, T16023C, T17989C, T18660C, A22330G, C25528T, A25974G, and G27382C, UShER returns a single subtree: To visualize on UShER: https://nextstrain.org/fetch/github.com/alurqu/pango-designation-support-alurqu/raw/main/2022/09/singleSubtreeAuspice_genome_BA.5.2%2BORF1_1050N%2B27012T%2B27513T%2BNoE_9I_trimmed.json?branchLabel=aa%20mutations&c=gt-E_9&label=nuc%20mutations:G12310A
As of 2022-09-11, Cov-Spectrum reports 2457 BA.5.2+ORF1b:1050N+27012T+27513T+E:9T sequences with good quality control scores without removing the small secondary subtrees: Source: https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?nextcladeQcOverallScoreTo=29&variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&
After removing the small secondary subtrees, as of 2022-09-11 Cov-Spectrum reports 2424 BA.5.2+ORF1b:1050N+27012T+27513T+E:9T sequences with good quality control scores: Source: https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?nextcladeQcOverallScoreTo=29&variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&
As of 2022-09-11, without trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 7% compared to BA.5.2+ORF1b:T1050N in Japan and 22% in the United States: Source: https://cov-spectrum.org/explore/Japan/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
As of 2022-09-11, with trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 9% compared to BA.5.2+ORF1b:T1050N in Japan and 30% in the United States: Source: https://cov-spectrum.org/explore/Japans/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
As of 2022-09-11, without trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 9% compared to BA.5 in Japan and 36% in the United States: Source: https://cov-spectrum.org/explore/Japan/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5*&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
As of 2022-09-11, with trimming out small secondary subtrees, and considering only sequences with good quality control scores, Cov-Spectrum calculates a growth advantage of 9% compared to BA.5 in Japan and 43% in the United States: Source: https://cov-spectrum.org/explore/Japan/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
Source: https://cov-spectrum.org/explore/United%20States/AllSamples/AllTimes/variants?variantQuery=nextcladePangoLineage%3ABA.5*&variantQuery1=nextcladePangoLineage%3ABA.5.2+%26+ORF1b%3AT1050N+%26+C27012T+%26+C27513T+%26+E%3AT9T+%26+G12310A+%26+C23854A+%26+%21C936T+%26+%21A1585G+%26+%21A1587T+%26+%21A1953G+%26+%21C4575T+%26+%21A5475G+%26+%21C8127T+%26+%21C11479T+%26+%21G12793A+%26+%21A14157G+%26+%21T16023C+%26+%21T17989C+%26+%21T18660C+%26+%21A22330G+%26+%21C25528T+%26+%21A25974G+%26+%21G27382C&analysisMode=CompareToBaseline&nextcladeQcOverallScoreTo=29&
Note that additional smaller subtrees may continue to arise.
CoV-Spectrum suggests that the first detection occurred in Thailand in Week 18 of 2022.
First GenBank sequence: Utah, USA 2022-06-13
Most Recent GenBank sequence: California, USA and Illinois, USA 2022-08-30
Based on Xia et al https://doi.org/10.1101/2022.02.01.478647 along with https://doi.org/10.1038/s41422-021-00519-4, the E:T9T reversion could be associated with a change in respiratory disease severity.
A zip archive of GenBank-formatted and derived metadata and FASTA files plus UShER output files for these untrimmed and trimmed sets of these sequences is available at Support-BA.5.2_ORF1b_1050N+27012T+27513T+NoE_9I.zip