cov-lineages / pango-designation

Repository for suggesting new lineages that should be added to the current scheme
Other
1.04k stars 98 forks source link

Potential Problem assigning B.1.623 and B.1.621 #115

Closed rmcolq closed 3 years ago

rmcolq commented 3 years ago

Copied over from https://github.com/cov-lineages/pangolin/issues/253

Dear Pangolin team,

When assigning lineages to patient samples with pangolin I stumbled over a cluster that looks like B.1.621 but was assigned B.1.623. The mutations we find definitely fit the B.1.621 in GISAID and not B.1.623, so either the lineage definition was changed or something else went wrong. I uploaded the sequences to USHER (https://hgwdev-angie.gi.ucsc.edu/cgi-bin/hgPhyloPlace) and got a strange result there too. Usher classified the sequence as B.1.621, but the closest sequences where B.1.623 (I guess classified by pangolin in GISAID). An example of the AA changes our sequences have : ORF1ab:T1055A; ORF1ab:T1538I; ORF1ab:N2147D; ORF1ab:T2952I; ORF1ab:T3255I; ORF1ab:Q3729R; ORF1ab:P4715L; ORF1ab:P5743S; ORF1ab:P5815S; ORF1ab:A6532V; ORF1ab:S6713L; S:T95I; S:Y144S; S:R346K; S:E484K; S:N501Y; S:D614G; S:P681H; S:D950N; ORF3a:Q57H; ORF3a:V256fs; ORF3a:V259L; ORF8:T11K; ORF8:P38S; ORF8:S67F; N:T205I

Of these the following fit B.1.621 in https://outbreak.info/situation-reports?pango=B.1.621 ORF1ab:T1055A; ORF1ab:T1538I; ORF1ab:T3255I; ORF1ab:Q3729R; ORF1ab:P4715L; ORF1ab:P5743S; S:T95I; S:Y144S; S:R346K; S:E484K; S:N501Y; S:D614G; S:P681H; S:D950N; ORF3a:Q57H; ORF8:T11K; ORF8:P38S; ORF8:S67F; N:T205I and the following B.1.623 in https://outbreak.info/situation-reports?pango=B.1.623 ORF1ab:P4715L; S:N501Y; S:D614G; S:P681H; ORF3a:Q57H; with 9 AA changes missing

Attached please find a fasta with 10 of the sequences. I use: pangolin==3.0.3 lineages-2020-05-19-2 pangoLEARN-2021-06-15 scorpio-0.3 constellations-0.0.5 pango-designation v1.2.13 All the best and thanks again fro the great tools you provide,

Sorry, just checked on GISAID (17/06/2021, 1:24:19 PM), and one of the closest sequences according to USHER, EPI_ISL_2220456, is also classified as B.1.623 (version 3.1.1, lineages version 2021-06-15), although the mutations eg: Spike D614G, Spike D950N, Spike E484K, Spike ins145N, Spike N501Y, Spike P681H, Spike R346K, Spike T95I, Spike Y144T, Spike Y145S would rather fit B.1.621 cov-lineages/pango-designation#62 than B.1.623 cov-lineages/pango-designation#25 Lukas

b.1.623_run_994_10.fasta.gz Dr. Lukas Endler Senior Postdoc

CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences Lazarettgasse 14, AKH BT 25.3 1090 Vienna, Austria Phone +43-1/40160-70055 lendler@cemm.oeaw.ac.at www.cemm.at

jcw349 commented 3 years ago

Hi,

We are also seeing all of our B.1.623 sequences with B.1.621 mutations listed above clustering together and growing rapidly over time. These "B.1.623" are all branching off of the same ancestral node in B.1.621 and are genetically unrelated to the canonical B.1.623. I think this might be an emerging sublineage of B.1.621, suggesting B.1.621.1 for the sublineage.

Jade Wang

rmcolq commented 3 years ago

Hi @jcw349, @luenling - are any of your sequences on GISAID? If so, some accessions/sequence IDs would be awesome!

ha-ruiz75 commented 3 years ago

Hello, all.

I have also experienced this issue in our sequencing results. I have compiled a list of GISAID sequences with <5% Ns that should be assigned as B.1.621 but are assigned to other lineages (mostly B.1.623).

List

genomes_should_be_B_1_621.txt

jcw349 commented 3 years ago

@rmcolq We are working in collaboration with Pandemic Response Lab (PRL). So far only 1 sequence made it onto GISAID, EPI_ISL_1828364. I uploaded some more today will let you know if they make it through.

In the mean time. This is the phylogenetic tree we put together. The leaf names are a substring of GISAID virus names if the genomes make it onto GISAID.

The "Assigned_Pangolin_Lineage" indicates the lineage to which the genome was assigned (B.1.621 or B.1.623). As you can see a subset genomes assigned to B.1.623 (blue) are a sublineage of B.1.621 (yellow with red branch). This sublineage are genomically similar and they all have S:E484K, S:N501Y, as well as other mutations Dr. Endler pointed out. The other B.1.623 on the other hand do not all have these mutations and are genetically distinct from the B.1.621 clade.

luenling commented 3 years ago

Dear Rachel @rmcolq , I just uploaded some of our sequences to GISAID, so hopefully I can send you the IDs in the next days. Thanks for looking into it and all the best, Lukas

luenling commented 3 years ago

Dear Rachel @rmcolq ,

Attached a list of 48 GISAID IDs and virus names (tsv with first 4 columns from GISAID) of our sequences that were assigned B.1.623 and should be B.1.621. They are all from a cluster in Vorarlberg so not much diversity there.

All the best, Lukas

b.1.621_cluster_Vorarlberg_CeMM.txt

jcw349 commented 3 years ago

Hi, I realized our sequences may have been rejected by GISAID, without error messages from them, due to frameshift insertions and deletions.

According to cov-glue, they have: Novel non-codon-aligned insertion in S: 21990-TAC-21991 Novel frameshifting deletion in ORF 3a: nucleotides 26158-26161

I will double checking our depth of coverage and attempt to resubmit. Is there another way for us to send you the fasta files in the mean time?

Thank you, Jade W.

rmcolq commented 3 years ago

@chrisruis I took an initial look at this, but it really isn't a simple case and I don't have the experience to resolve it. It looks like when the ids provided are uploaded to https://hgwdev-angie.gi.ucsc.edu/cgi-bin/hgPhyloPlace they mostly cluster together on the big tree, but the nearby lineages are not nicely coloured by pangolin lineage - probably haven't sampled enough of the diversity in the region in pango-designations to date. There are several subtrees, the clearest of which is this:

Screenshot 2021-06-18 at 16 37 08

Can you take a look?

AngieHinrichs commented 3 years ago

@rmcolq it looks the default subtree size of 50 was used -- it's really meant for uploading only a handful of sequences and seeing the names of sequences that they're most closely related to. For bigger tree questions like this, I would use a much larger subtree size like 1000 (the max is currently 5000) -- it gives a lot more context, and you can then click on branches of interest to zoom in and see the leaf labels.

Regarding the appearance of B.1 and B.1.429 calls in the branch that's mostly B.1.621 -- I usually interpret that as either pangoLEARN being affected by noise / new mutations that have appeared in other lineages, or perhaps lots of Ns (or contamination, new mutations etc) leading to multiple equally parsimonious placements for that sequence. UShER picked the placement that you see, but that might not be the only plausible placement for that sequence. Unfortunately there's no way to tell about Ns aside from examining the original sequence at this point. (If I had tons of time, I would find a way to get more of that info into the metadata...)

One more thing: https://hgwdev-angie.gi.ucsc.edu/cgi-bin/hgPhyloPlace is unstable -- it's my sandbox and may be completely broken when I'm working on something new. https://genome.ucsc.edu/cgi-bin/hgPhyloPlace is the official location, although its tree & metadata may lag by a day (or sometimes a few days) and its code lags by ~one week to one month relative to the more-stable-than-hgwdev-angie test server hgwdev.gi.ucsc.edu. :)

AngieHinrichs commented 3 years ago

I pasted in both Héctor's and Lukas's sample IDs with a subtree size of 1000 so you can see more B.1.621 sequences (purple) around the cluster that is very consistently being misidentified as B.1.623 (shades of blue):

https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/pango-designation-115.json?c=pango_lineage

image

rmcolq commented 3 years ago

That looks much clearer! Thanks @AngieHinrichs. It appears that there are 2 extra ORF1a mutations (N2147D and T2952I) on the branch leading to the cluster, but otherwise it clearly fits well within the B.1.621 lineage. I don't see evidence of these new samples being corresponding to an introduction to a new place. Are you all happy with a lineage definition of B.1.621 for your samples (in which case I will pick some to add to the designations list)? Alternatively, do you have any evidence suggesting it should be a new lineage (https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/)?

jcw349 commented 3 years ago

Hi, I think B.1.621 is a more appropriate assignment than B.1.623, however, I think this might be a sub-lineage of B.1.621.

We didn't get to put a phylogenetic tree together with all the sequences listed by others in this thread, but based on our data, it seems like the B.1.623 group that needs to be reclassified (green) and is growing faster in NYC than other B.1.621 (blue) (see Counts over time figure). It looks like in Angie Hinrichs tree, B.1.623 and "uploaded sample" are also mostly falling into 1 branch as well.

Counts over time: image

single branch in B.1.621 In the phylogenetic tree, misclassified group (purple branch) is a single branch with large number of non-identical sequences, that stem from the B.1.621 (red branch) image

NYC PHL and PRL are working on resubmitting these sequences to GISAID. As mentioned earlier, there are novel insertions and deletions in this cluster that are causing GISAID to reject our sequences. Sorry for the delays!

Thank you! Jade W.

rmcolq commented 3 years ago

I've added as many of these as I could to a newly designated lineage B.1.621.1, released in pango-designation v1.2.29. This will filter into the training over the next week.