cov-lineages / pangolin-data

Repository for storing latest model, protobuf, designation hash and alias files for pangolin assignments
GNU General Public License v3.0
27 stars 2 forks source link

Unexpected assigment of (potential) recombinants #54

Open MarieLataretu opened 5 months ago

MarieLataretu commented 5 months ago

Hi there,

First, thanks for your work and the latest updates!

We stumbled across a few samples from the last months that pangolin assigns to a top-level lineage, namely BA.2 or XBB.1. The nextclade calde assignment resolves to recombinant; the Nextclade_pango assignment XDD or XCT.1. Since XDD and XCT.1 were not part of the 1.23.1 pangolin-data version, it's not surprising, that pangolin does not assign these lineages.

However, we'd expect that pangolin would assign a (new) recombinant with the latest data release. I did a little test series:

sample pangolin-data 1.23.1 pangolin-data 1.24 pangolin-data 1.25 pangolin-data 1.25.1 nextclade2 2024-01-15 nextclade3 2024-01-16 nextclade3 2024-02-16
82 BA.2 BA.2 JN.1.1 JN.1.1 XDD XDD XDS
84 BA.2 JN.1.1 JN.1.1 JN.1.1 XDD XDD XDS
85 XBB.1 JN.1.1 JN.1.1 JN.1.1 XDD XDD XDS
63 XBB.1 BA.2 BA.2 BA.2 XCT.1 XCT.1 XCT.1
30 XBB.1 XCT.1 XCT.1 XCT.1 XCT.1 XCT.1 XCT.1
51 BA.2 XDD JN.1.1 JN.1.1 XDD XDD XDD

(Tool versions: pangolin v4.3, nexclade3 v3.2.1, nextclade2 v2.14.0)

I'm wondering now, if this is a problem in pangolin - or we see an undesignated lineage. I read that Nextclade is not perfect in assigning recombinants. However, it is (more) consistent over the dataset versions.

I'm happy for any input or feedback! 🙂

Best Marie

AngieHinrichs commented 5 months ago

Hi Marie -- without looking at the sequences, I can't say for sure what's going on. Are they in GISAID? If not, are you able to upload them to https://usher.bio/ (select the full tree of 16M sequences including GISAID and increase sample size to >= 500) in order to see which sequences they most closely resemble, and what mutations make your sequences different?

Unlike nextclade, pangolin doesn't have a general 'recombinant' category; it can only assign Pango lineages. Some things that may lead to flip-flopping assignments in successive releases are a high number of N or other ambiguous bases, or a mix of mutations associated with different lineages, whether that's due to a new recombinant, mixed infection or contamination in sequencing.

If the sequences are in GISAID, there are some very keen volunteers such as @aviczhl2, @JosetteSchoenma and @FedeGueli who search for new potential recombinants and may have already taken a look.

FedeGueli commented 5 months ago

Hi Marie -- without looking at the sequences, I can't say for sure what's going on. Are they in GISAID? If not, are you able to upload them to https://usher.bio/ (select the full tree of 16M sequences including GISAID and increase sample size to >= 500) in order to see which sequences they most closely resemble, and what mutations make your sequences different?

Unlike nextclade, pangolin doesn't have a general 'recombinant' category; it can only assign Pango lineages. Some things that may lead to flip-flopping assignments in successive releases are a high number of N or other ambiguous bases, or a mix of mutations associated with different lineages, whether that's due to a new recombinant, mixed infection or contamination in sequencing.

If the sequences are in GISAID, there are some very keen volunteers such as @aviczhl2, @JosetteSchoenma and @FedeGueli who search for new potential recombinants and may have already taken a look.

Recombinants have been tracked by @aviczhl2 @josettshoenma and @over-there-is i dont think there is something that went under the radar. but i can suggest to try to verify if any Epi_ISl of this putative lineage is present in https://github.com/sars-cov-2-variants/lineage-proposals/issues/957#issuecomment-1954497147 via a simple query with the github search tool or more specific looking for them on this .tsv: https://github.com/sars-cov-2-variants/lineage-proposals/blob/main/recombinants.tsv

If i can get a list of the IDs i could search for them on my own and update then here

JosetteSchoenma commented 5 months ago

IMO, the best way to know if a batch of samples includes recombinants (if you are not used to recognizing them in Nextclade), is to look through GitHub issues and run the mentioned GISAID queries. Which of course takes time!

Nextclade and Pangolin will always be a bit behind and sometimes inaccurate.

But if you have a list with EPI_ISL numbers or if you could tell me which country and dates you're interested in, one of us will probably be happy to have a look.

aviczhl2 commented 5 months ago

There are hundreds of different undesignated recombinants. Most of them are registered in https://github.com/sars-cov-2-variants/lineage-proposals/issues/991 and https://github.com/sars-cov-2-variants/lineage-proposals/blob/main/recombinants.tsv If you see new ones, welcome to register in that repo too.

MarieLataretu commented 5 months ago

Hi all, thanks for all the feedback!

Unfortunately, only one sequence is on GISAID - I can keep you posted on that (best case, next week, I'd say). EPI_ISL_18599826 is the 4ht sample (63 in the table)

Some things that may lead to flip-flopping assignments in successive releases are a high number of N or other ambiguous bases, or a mix of mutations associated with different lineages, whether that's due to a new recombinant, mixed infection or contamination in sequencing.

The N content is decent (below 3.9 %), and ambiguous bases are masked.

I checked the mapping and it does not look like a mixed infection.

Nextclade's qc.privateMutations.status ranges from good, to mediocre, to bad - not sure if this a good proxy for a mix of mutations of different lineages 🤔

I threw the samples in https://usher.bio/ (full tree, sample size to 1000). Here is a screenshot of the overview: recombinants_hgPyhyloPlace

For pangolin-data 1.25.1, only one sample differs (JN.1.1 vs XDD; was XDD with 1.24)

JosetteSchoenma commented 5 months ago

The first 3 are linked to this singlet that @aviczhl2 found. You would have to put them all together in Nextclade to see if they match.

EPI_ISL_18715763 https://github.com/sars-cov-2-variants/lineage-proposals/issues/991#issuecomment-1876243840

JosetteSchoenma commented 5 months ago

The 4th one, called BA.2 is linked to a pretty clean XCT.1 with only a reversion of C7051T. EPI_ISL_18599826

JosetteSchoenma commented 5 months ago

The 5th is linked to a completely normal XCT.1 from Austria. EPI_ISL_18385324

JosetteSchoenma commented 5 months ago

The 6th is linked to a completely normal looking XDD from France. You could check yours for mutations C6541T, A7842G, T15756A and A26275G to confirm it is an XDD.

AngieHinrichs commented 5 months ago

Thanks for the insights @JosetteSchoenma. @MarieLataretu you can see a lot more detail about the neighboring sequences, and what mutations separate your sequences from those sequences, if you click on the 'view in Nextstrain' links.

MarieLataretu commented 5 months ago

The 6th is linked to a completely normal looking XDD from France. You could check yours for mutations C6541T, A7842G, T15756A and A26275G to confirm it is an XDD.

I checked the four mutations (in the Nextclade output), and all 4 are present!

The subtree in Nextstrain does not show any mutations: grafik

Do I interpret it correctly that it's indeed an XDD (most probably)?

JosetteSchoenma commented 5 months ago

The 6th is linked to a completely normal looking XDD from France. You could check yours for mutations C6541T, A7842G, T15756A and A26275G to confirm it is an XDD.

I checked the four mutations (in the Nextclade output), and all 4 are present!

The subtree in Nextstrain does not show any mutations: grafik

Do I interpret it correctly that it's indeed an XDD (most probably)?

Yes, very likely an XDD.

MarieLataretu commented 5 months ago

The 4th one, called BA.2 is linked to a pretty clean XCT.1 with only a reversion of C7051T. EPI_ISL_18599826

Oh shoot, I overlooked that one sample is already on GISAID! 🙈

The 4th sample (63 in the table) is exactly EPI_ISL_18599826!

MarieLataretu commented 5 months ago

The first 3 are linked to this singlet that @aviczhl2 found. You would have to put them all together in Nextclade to see if they match.

EPI_ISL_18715763 sars-cov-2-variants/lineage-proposals#991 (comment)

They are linked, but the 3 sequences have 4 additional mutations in the ORF1ab compared to EPI_ISL_18715763:

grafik

AngieHinrichs commented 5 months ago

@MarieLataretu I would like to look into why your sixth sample (51) is not classified as XDD by recent versions of pangolin-data. Can you share the sequence (email: angie at soe dot ucsc dot edu), or if that's not allowed, update this issue with its EPI_ISL ID when it is in GISAID? Thanks!

aviczhl2 commented 5 months ago

The first 3 are linked to this singlet that @aviczhl2 found. You would have to put them all together in Nextclade to see if they match. EPI_ISL_18715763 sars-cov-2-variants/lineage-proposals#991 (comment)

They are linked, but the 3 sequences have 4 additional mutations in the ORF1ab compared to EPI_ISL_18715763:

grafik

This looks like an independent new HV.1/JN.1 recombinant with similar breakpoint as 18715763(which is JG.3/JN.1 recomb) The "additional mutations" basically reverts the JG.3 defining and adds the HV.1 defining mutations.

AngieHinrichs commented 5 months ago

Thanks @MarieLataretu for sharing the sample 51 sequence. It turns out that one missing mutation (or reversion to reference relative to XDD) is causing it to be placed just short of XDD in the pangolin-data 1.25.1 minimized tree.

In the minimized tree, the final node on the path to XDD has these mutations:

C6541T, G11727A, C18894T, T22926C, A26275G, C26529G, T26681C, T26833C, C29625T

sample 51 has all of those except for T22926C. If it had an N at 22926, then usher would impute a C because of all the other matches, but it has the reference allele T at 22926. So usher splits that node up, creating a new node, with all mutations except T22926C, and moving the original node (labeled XDD) to become a child of the new node with only T22926C. sample 51 also becomes a child of the new node -- a sibling of XDD, so it misses the assignment. That's the long way of saying that missing a single mutation at the final node can cause a missed assignment, unfortunately.

In the full tree, there are some XDD sequences that share the mutation G5155A with sample 51, so sample 51 is placed in XDD on that branch, with one private mutation (T21810C) and multiple reversions to reference (T21711C, C22926T, G26610A):

image

https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/pangolin-data-54.json?branchLabel=nuc%20mutations&label=id:node_6955286

How strong is the read-level evidence for sample 51 having the reference allele instead of the expected XDD mutations at reference positions 21711, 22926 and 26610? If the coverage is very low there, it would be better from the usher point of view to have N instead of reference allele.

I can make the matching a little less stringent in the next release of pangolin-data by adding a pseudo-lineage label "XDD_dropout" in the full tree, a couple nodes upstream of XDD. When minimizing the full tree to make the next release of pangolin_data, the "_dropout" will be truncated so there will be a second "XDD" label a bit upstream of where XDD really starts, and that will assign XDD a bit more broadly (hopefully not too broadly).

MarieLataretu commented 4 months ago

Thanks for the insight, @AngieHinrichs ! I'll check the mentioned positions in detail and get back to you. (It might take some time, because I'm travelling atm)