cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.05k stars 98 forks source link

AY.34 may need to be broadened to catch all descendants from common ancestor with S:Q677H #233

Closed corneliusroemer closed 3 years ago

corneliusroemer commented 3 years ago

This is more of a discussion/question than a proposal, because the right way forward isn't clear to me.

A week or two ago, AY.34 was designated in #217, proposed by @c19850727. It looks like this designation was too narrow. When I look at a current tree, it'd be best to make this lineage broader, by removing S:L1265F from the list of defining mutations.

The entire cluster with S:Q677H and Orf1a:3059F seems significant. The sub-lineage with S:L1265F does not really seem extra significant on top of these root mutations.

Have a look at this current Europe tree, the current AY.34 makes up only 3 sequences (subsampled) in this tree. Whereas there are almost 10 times as many that belong to the larger S:677H cluster.

image

https://nextstrain.org/groups/neherlab/ncov/europe/2021-10-04?branchLabel=spike_mutations&c=S1_mutations&label=spike_mutations:Q677H

Given that it is not unreasonable to associate S:Q677H with a change in function due to it's previous occurrence in other VOCs, whereas this is not the case for S:L1265F, I would think that it would make more sense to redefine what is included in AY.34 and broaden it.

The easiest would be just to designate those sequences with S:Q677H and Orf1a:3059F as also AY.34. Since the designation is still fresh, it should not throw too many people off - AY.34 has only just been applied by GISAID.

Here are some sample sequences from the broader cluster: q677_cluster.txt

Here are Usher results: image

https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/singleSubtreeAuspice_genome_3df6f_b5b3e0.json?label=nuc%20mutations:T9053G

c19850727 commented 3 years ago

@corneliusroemer Thanks. And yes I agree that it might make more sense now to use S:Q677H and Orf1a:3059F as the defining substitutions.

When I was submitting the designation proposal, I first noticed its repeated apprearance among border-entry tests, which happened to have S:L1265F, but it's clear now that they belong only to a subclade of what's growing fast.

FedeGueli commented 3 years ago

This change would be of great relevance to track the epidemic: with the current defining mutations it already represents a significant share of genomes sequenced in France and Italy and probably it has been disseminated elsewhere too. Example: Chile

FedeGueli commented 3 years ago

Screenshot_2021-10-05-13-57-50-744_com android chrome

FedeGueli commented 3 years ago

Screenshot_2021-10-05-13-59-55-805_com android chrome

corneliusroemer commented 3 years ago

@chrisruis @rmcolq (let me know if I should not tag you like this. It's the best way I know of right now to draw attention to important issues).

This issue is relatively urgent as the lineage has been released and so the sooner the broadening (if it happens) is decided on, the less it breaks existing assumptions about what is and is not AY.34

theosanderson commented 3 years ago

@corneliusroemer I'm not sure this lineage has been defined in the way you fear, despite the use of "defining" in that Issue. Of AY.34 predictions in GISAID at present, the vast majority lack S:L1265F. And the UShER Taxonium tree (with some lineage predictions still lagging) shows reasonable coverage AY.34 across the 677H clade. image

Edit: and this is true of designations too: the majority of designated seqs lack L1265F. So essentially I think the curators had already made the decision you are suggesting.

AngieHinrichs commented 3 years ago

The pangoLEARN/pangoLEARN/data/lineageTree.pb path to AY.34 is the path to B.1.617.2 plus these (and it matches the root node of all representative sequences in the full UCSC/UShER tree):

C10029T > C19220T > G28916T > G4181T > C27874T > C6402T > C7124T > G9053T > C8986T > T6402C > A11332G > A11201G > C6402T > C21846T > T9053G > G19563A,C26681T > G9441T > G18255T,G23593C > G26109A > C22498T,G27014T

S:Q677H is G23593C, so it's the third-to-last node in that path, followed by two nodes with synonymous mutations (G26109A is ORF3a:E239E, C22498T is S:I312I, G27014T is M:L164L). And in the full tree, it looks like there's only one sample hanging off each of the last two nodes with synonymous mutations (CzechRepublic/CSQ0402/2021|EPI_ISL_3446650|2021-08-09 and Italy/FVG-UD-380095/2021|EPI_ISL_3980553|2021-08-26), so hopefully not too much of a loss there. :)

corneliusroemer commented 3 years ago

Fantastic, thanks for checking/clarifying @theosanderson and @AngieHinrichs!

What's the best way to quickly check how a lineage is actually defined/designated?

@theosanderson do you have a taxodium build with only designated sequences? That could be very useful to visualize/study current designations without having any pangoLEARN classification errors showing up.

Similar question for @AngieHinrichs: the lineage tree you mention, is that the reference tree used by pangoLEARN's usher mode? What's the best way to study that tree to quickly to analyses like you just helpfully did myself, so that I don't have to make false alarms in the future?

I'll close the issue as it seems to have been a false alarm.

rmcolq commented 3 years ago

The designated set of sequences "define" a lineage rather than the set of mutations we used to pick out the designated set of sequences. As a result there is a degree of variation within the trained models. At the moment there is no one-shot quick way to see how a lineage is actually defined, although we do now check that there aren't too many designated sequences which get incorrectly assigned without the hash.

The lineageTree.pb file https://github.com/cov-lineages/pangoLEARN/blob/master/pangoLEARN/data/lineageTree.pb is the model used by the pangolin --usher mode. It is created/trained by @AngieHinrichs by subsetting designated sequences and annotating the ancestral/lineage defining node for each lineage in the usher tree (but done in a more considered way over by selecting multiple subsets etc). As a result, it may end up annotating a node slightly broader than the set of designations (or possibly sometimes narrower). Similarly, we downsample the set of designated sequences for pangoLEARN training and then use a decision tree to decide the relevant sites for classification. The pangoLEARN model has a summary file here https://github.com/cov-lineages/pangoLEARN/blob/master/pangoLEARN/data/decision_tree_rules.zip but I believe it is not so easy to understand.

theosanderson commented 3 years ago

@corneliusroemer I agree this build would be really useful, and I had a go at it the evening I posted this reply. The challenge is that too many designated seqs aren't in the public tree or are added with too much of a lag. This may become more possible when Cov2Tree is available within GISAID which I continue to work on.