cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 97 forks source link

potential new BA.2 sublineage from Germany #443

Closed markusglass closed 2 years ago

markusglass commented 2 years ago

Hi together,

recently we sequenced 12 samples that could not be assigned a lineage via pangolin (None). They all share many signature mutations with BA.2, especially 7 of them that were apparently very close to BA.2 (pangolin report says: Omicron (Unassigned); scorpio replaced lineage assignment BA.2). Besides the Mutations specifically for BA.2 all 12 had the ORF7b T40I mutation that I found so far only described for Delta variants but not Omicrons.

Unfortunately, I don't know how to create these nice phylogenetic trees I've seen in the other issues with my sequences, however they are already placed at GISAID:

EPI_ISL_9862285 EPI_ISL_9862286 EPI_ISL_9862283 EPI_ISL_9862294 EPI_ISL_9862284 EPI_ISL_9862292 EPI_ISL_9862293 EPI_ISL_9862290 EPI_ISL_9862291 EPI_ISL_9862289 EPI_ISL_9862287 EPI_ISL_9862288

The 7 more similar sequences are

EPI_ISL_9862283 EPI_ISL_9862284 EPI_ISL_9862290 EPI_ISL_9862291 EPI_ISL_9862292 EPI_ISL_9862293 EPI_ISL_9862294

The samples were all gathered between middle and end of January in Saxony-Anhalt/Germany. I used pangolin_version 3.1.20, pangoLEARN_version 2022-02-02 and pango_version v1.2.124.

agolsby commented 2 years ago

Hello, often these trees are generated by running the sequences through UShER or Nextclade.

You can upload fasta files at clades.nextstrain.org. The "Tree" button is at the top right of the results page.

For UShER, check the web app hosted here: http://genome.ucsc.edu/cgi-bin/hgPhyloPlace

You can upload a fasta file here too. But you also have the option to paste GISAID or NCBI accession numbers into the text box as shown below.

UShER Submission page: USHER HOWTO

You'll reach a results page that shows the Nextstrain clade and Pangolin lineage assignments generated by UShER tree placement. You'll also see the assignments generated by pangoLEARN.

You'll notice that this results page has a warning at the top because some of the more recent sequences are not yet available. To avoid this delay, you would have to upload the sequences as a FASTA instead of pasting in their accession IDs.

Results page: usher results

Getting to the heart of your post, I do notice that their pangolin is calling these samples as BA.2 rather than "None". That's interesting.

To check the trees, go to the two buttons above the results list. The first shows where your sequences fit in the broader SARS-CoV-2 phylogeny.

Nextstrain global tree placement: nextstrain results

To see a subtree focused around your samples, click the "View in Nextstrain" links on the far right column of the results chart. Heavily polyphyletic uploads will have to be parsed into multiple subtrees. Here, there's only one because these samples are pretty closely related.

Intriguingly, the subtree reveals a sample identical to one of yours: Denmark/DCGC-298335/2021|EPI_ISL_8459852|2021-12-29 0217 denmark sample

As for those India samples, I'm not confident that they're relevant. I've set the branch labels set to show back-mutations. This makes it easier to see branches where UShER may be concocting spurious reversions for the sake of maximum parsimony.

P.S. I couldn't tell you why only the Nextstrain tree only shows 7 of the 14 sequences UShER parsed, even in the subtree. Might be too off topic for this thread.

AngieHinrichs commented 2 years ago

Thanks @agolsby for the nice tutorial!

Getting to the heart of your post, I do notice that their pangolin is calling these samples as BA.2 rather than "None". That's interesting.

Ah, our pangolin assignments need to be rerun since the Feb. 9th releases of pangoLEARN and constellations. Working on that now.

You'll notice that this results page has a warning at the top because some of the more recent sequences are not yet available. To avoid this delay, you would have to upload the sequences as a FASTA instead of pasting in their accession IDs.

There are some increasingly common other reasons that sequences may not be present in the UCSC/UShER tree:

EPI_ISL_9862285-EPI_ISL_9862289 were all excluded because they fit the profile of contaminated/mixture(/recombinant) of Omicron and something else. Nextclade placed those sequences on a branch in 21J (Delta), with 11-12 reversions to reference each and 25-27 labeledSubstitutions each, most of which were for Omicron (21K, 21L, 21M). Having a non-Omicron assigned clade, and >5 Omicron-associated labeledSubstitutions, got them excluded from the tree.

EPI_ISL_9862285-EPI_ISL_9862289 all have 53-55 substitutions relative to the Wuhan/Hu-1 reference.

EPI_ISL_9862283, EPI_ISL_9862284, and EPI_ISL_9862290-EPI_ISL_9862294 have 68-81 substitutions relative to the reference. Nextclade placed them on 21L (Omicron) and found only one reversion to reference in each, so they were not excluded. Nextclade also reported 2-9 labeledSubstitutions associated with 21J, but I don't currently have a filter for that. If I add one with my favorite arbitrary threshold of 5, that would reject EPI_ISL_9862290 and EPI_ISL_9862291 but keep the others in this group.

the Nextstrain tree only shows 7 of the 14 sequences UShER parsed

I think that's because some IDs appear more than once in the lists above -- the second group of 7 IDs is a subset of the first group. There are 12 unique IDs, and 5 were not found in the tree. The output table shows 7 sequences twice each (no checking for duplicates), but the tree is showing them uniquely.

I uploaded fasta for the 5 excluded sequences to the UShER web interface, and they were placed on a branch that is in Delta, but is riddled with reversions to reference, a warning sign for issues with sequencing and assembly: https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/pango-designation-443.json?branchLabel=back-mutations&label=nuc%20mutations:G210T,C241T,C3037T,G4181T,C7124T,C8986T,G9053T,C14408T,C16466T,T22917G,C22995A,A23403G,C23604G,C25469T,T26767C,C27874T,A28461G,G28881T,G29402T,G29742T Another warning sign is the wide variety of Nextclade assignments for sequences in this little branch; many of the sequences also have quite a few private mutations. image A good candidate for manual pruning (or if I really get tired of seeing these mini-rainbow back-mutation-party branches, perhaps @theosanderson's treeShears).

There's more to say about distinguishing between mixtures/contamination vs. true recombinants, but this is too long already and I'll leave it to others with more expertise in that.

markusglass commented 2 years ago

Thank you @agolsby and @AngieHinrichs for your detailed explanations and hints! I will check the vcf files for signs of mixtures in these sequences.

corneliusroemer commented 2 years ago

Thanks @markusglass for sharing your findings and bringing these sequences to our attention. It's always great when people take a look at their own sequences and share any findings.

I ran all the sequences you highlighted through nextclade.org

To me it looks very much like there is some co-infection/contamination going on instead of recombination, because:

@thomasppeacock also had a quick look and had the same thoughts. Since @AngieHinrichs also agrees, I'll close this issue for now, but feel free to comment with updates if you investigate further. You could also share raw reads for others to see what's going on on an amplicon level.

Here are some graphics from Nextclade:

image image