S:G142D is nearly universal in Delta, but often miscalled; how to fix Nextstrain cladistics?

mkedwards commented 3 years ago

This has been commented on in several lineage requests, but probably deserves its own issue.

From footnote to Table 13 of https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/994839/Variants_of_Concern_VOC_Technical_Briefing_16.pdf:

"Note that G142D is in a part of the genome with consistently reduced coverage in the Delta variant (due to the lineage-defining deletion from position 22029-22035, which affects one of the PCR primer sites in the ARTIC v3 protocol). While it is only reported as detected in ~60% of sequences, the remaining 40% of sequences are almost all “N” at that position (the code for “insufficient data”), rather than being confirmed ”G” (the reference allele). As the mutation occurred early in the history of the lineage the majority of sequences (>99%) in this lineage can be assumed to harbour the mutation."

Miscalls of S:142 as the Wuhan-reference G rather than the Delta-reference D are making hash out of Nextstrain clade visualization. I don't know whether fixing that is within scope of pango-designation, but maybe raising awareness here will help?

rmcolq commented 3 years ago

I may have misunderstood your issue, but I think it is about an optimization/bug in Nextstrain. Unfortunately Nextstrain operates entirely independently of pangolin and pango-designation and updating the pango-lineage assignment of your samples will not change anything in Nextstrain. You could try out the discussion forum here: https://discussion.nextstrain.org/ and flag it there?

mkedwards commented 3 years ago

I'm asking largely for the advice of professionals here who have called out this G142D primer issue in other lineage requests. Is there a well understood solution to this, just some parameter to be added to the data reduction pipeline used by Nextstrain? Or is this an essential problem with algorithmically feasible classification / lineage modeling, similar to the way that indels are?

rmcolq commented 3 years ago

For the delta miscalls in pangolin, we use scorpio (https://github.com/cov-lineages/scorpio) as backup which specifically types at the sites which define the "constellation" Delta (https://github.com/cov-lineages/constellations/blob/main/constellations/definitions/cB.1.617.2.json), and decides if it should be classified based on the slightly looser rules "min_alt": 5 and "max_ref": 3. i.e. at each of the sites, the genome is classified as ref, alt, ambig (eg includes Ns) or other and we call it if at least 5 of the alt alleles have been identified and no more than 3 ref calls. This seems to be good enough most of the time for us to classify, but doesn't help so much when it comes to tree building. For tree building for the COG-UK phylogenetics pipeline I do mask some problematic sites (I'm not sure if we mask that one, but I have also had problems of delta forming clades in 2 places on my trees sometimes).

mkedwards commented 3 years ago

And while I'm asking about obstacles to accurate cladistics: @rmcolq do you have a good solution for insertions, which apparently some toolchains identify and others don't? I'm particularly interested in this apparent nine-nucleotide insertion in S in the A.2.5 lineage:

CCUAUUAAUUUAGUGCGUG 643 (22205) / CGGCAGGCU / 644 (22206) AUCUCCCUCAGGGUUUU

This seems to be one of several insertions that has been observed frequently enough to be fairly certain of their reality, according to the analysis in https://www.biorxiv.org/content/10.1101/2021.04.17.440288v2.full. From that paper:

This issue was particularly evident for both A.2.5 (and sublineages) and B.1.214.2, since of [sic] relevant number of GISAID entries lacked the expected lineage-defining insertions at RIR1, despite the shared ancestry of all genotypes (Figures 4 and 5). In detail, as of May 1st 2021, only 63% of the A.2.5 genomes carry S:ins214AAG, and only 76% of the B.1.214.2 genomes carry S:ins214TDR. These fraction of deposited genomes lacking insertions at RIR1 remains very high even if we only take into account complete, high quality genomes (i.e. 15% and 34% for A.2.5 and B.1.214.2, respectively), indicating that these artefacts are not linked with low sequencing coverage. As a striking example, just 41 out of the 193 B.214.2 genomes sequenced in Switzerland correctly report the insertion at RIR1. These entries have been submitted to GISAID by the university hospitals of Basel and Geneva, which most likely use in their routine genome analyses insertion-aware tools. On the other hand, all the B.214.2 genomes sequenced in Switzerland that lack the insertion were deposited by the same institution, i.e. ETH Zürich, which uses V-pipe [68], a tool that in its current configuration for SARS-CoV-2 variant calling disregards the possibility of insertions compared with the reference genome sequence.

Clearly, similar issues may affect other SARS-CoV-2 lineages carrying insertion mutations. For example lineage AT.1, a variant carrying an unusual insertion of four codons, close to the polybasic furin-like cleavage site (position 679), has been recently reported in Russia, and it is presently considered as a variant under monitoring by the ECDC (https://www.ecdc.europa.eu/en/covid-19/variants-concern), due to the contemporary presence of E484K.

I'm an amateur and unlikely to be able to contribute meaningfully to investigating this, but maybe flagging the issue helps a little? If you've got an account on virological.org, I'd sure like to know what William Gallaher (profbillg1901) observes in the surrounding context of this "recurrent insertion region". (Does the GUGCGUG just upstream of the insertion seem likely to count as a "short palindrome" in the sense discussed in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7394270/ ?)

cov-lineages / pango-designation

S:G142D is nearly universal in Delta, but often miscalled; how to fix Nextstrain cladistics? #117