Open corneliusroemer opened 2 years ago
Nextclade places about a third in BA.4, a third in BA.5 and for another third it's not sure whether it's BA.4 or BA.5
Quality is a bit of an issue in these sequences, lots of reversions, so I guess that's what pangoLEARN struggles with
But given the large share in South Africa that's called BA.2 rather than BA.4/5 it's probably still worth looking into.
Ran the sequences through pangolin. Result is clear: Usher places correctly, but Scorpio overrides falsely.
We need something like Probable Omicron (BA.4/5-like)
to solve this problem I think @aineniamh @AngieHinrichs @chrisruis
Or not let Probable Omicron (BA.2-like)
override from BA.4/5 to BA.2
@corneliusroemer what version of various components are you using, including Usher. Can uou post your commandline as well?
We've observed this same phenomenon. Here's a near-whole-genome tree (raxml) showing both BA.4 and BA.5 clades with interspersed BA.2 calls. Inspection of the alignment confirms these as correctly placed sequences, but mis-labeled. Branches are colored by call: blue, BA.5; orange, BA.4, light green, BA.2. There are lots of green branches in the blue and orange clades, and even a few BA.3s (yellow).
BA4-and-BA5.lastz.for-phy.raxml.rooted.longnames.2022-05-01.rainbowtree.pdf
@dbtara
❯ pangolin gisaid_hcov-19_2022_05_02_14.fasta -t 10
❯ pangolin --all-versions
pangolin: 4.0.6
pangolin-data: 1.6
constellations: v0.1.9
scorpio: 0.3.17
But I first observed this problem on pango assignments that come from GISAID to covSpectrum. This seems to be a general problem, not restricted to the way I'm using pangolin
.
Same thing happens with pangolin-data: 1.8
It's scorpio
which is at the root of the problem here.
Same thing happens if I use pangoLEARN, most are correctly spotted as BA.4/5 but then Scorpio overrides.
Last week, I alerted GISAID CoVSurfer team about BA.4/5 assignment discrepancy.
Many of them do not have signature spike mutations (L452R & F486V). @PeacockFlu spotted a dropout in the spike region. Further analysis showed that this drop is observed in >10% of recently assigned BA4/5 sequences.
90% of them have N_P151S,NS7b_L11F,NSP1_K141del,NSP1_S142del,NSP1_F143del mutations (probable BA.4)
GISAID is considering to alert the submitters to conduct Sanger sequences or fill the gap in the missing Spike region with alternative primer sets.
This issue is still present becoming ever more severe. Around 50% of Swiss sequences are actually BA.4/5 but wrongly called BA.2 due to this Scorpio problem.
Now that BA.4/5 is becoming dominant in many countries this is raising questions. Can we not just deactivate Scorpio overwrite to BA.2?
@aineniamh
This query below shows actual BA.4/5 that are misclassified as BA.2 or other:
The query was written by @shay671, and I have verified that it is both sensitive and specific (for current BA.4/5).
pangolin --skip-scorpio
(default UShER mode, without scorpio overriding results) can correctly identify those sequences as BA.4 despite the dropout, based on the other mutations in the sequences.
It has been over a month since the release of pangolin v4.0.6, and two months since the release of v4.0 with UShER as the default mode, with an optional assignment cache to speed up lineage assignment for pre-existing sequences, but GISAID is still using pangoLEARN for lineage assignments. GISAID may be unaware of the increased accuracy and stability of UShER mode, and although UShER is significantly slower than pangoLEARN, the assignment cache should help to make up for that. If GISAID hears from multiple community members (hint hint 😃 🙏) that they would prefer UShER assignments,and until scorpio/constellations are improved to better handle BA.4 and BA.5, --skip-scorpio
, maybe they will update their methods.
First of all, thanks a lot for all your great work!
Looks like this issue is persisting on Gisaid.
I noticed a growing number of sequences classified as BA.2 on Gisaid recently.
When I look at the 94 Swiss sequences labeled as BA.2 on Gisaid (submitted between 2023-04-01 and 2023-05-02) and run pangolin (versions see below) with the current default options (pangolin gisaid_hcov-19_2023_05_02_14.fasta
) the lineages are assigned as follows (all sublineages of XBB):
lineage n
EG.1 56
EU.1.1 30
EL.1 3
EM.1 2
EU.1 2
FE.1 1
But they all result in lineage BA.2 when I use scorpio (pangolin --analysis-mode scorpio gisaid_hcov-19_2023_05_02_14.fasta
), which Gisaid might use to classify their sequences. I sent them a message pointing that out.
I was using the following versions.
▶ pangolin --all-versions
pangolin: 4.2
pangolin-data: 1.19
constellations: v0.1.10
scorpio: 0.3.17
usher 0.6.2
gofasta 1.2.0
minimap2 2.24-r1122
faToVcf: 426
Thanks @sschmutz for investigating and pointing out that the issue is still relevant.
To the best of my knowledge, GISAID unfortunately still uses pangoLEARN
, potentially in combination with scorpio. pangolin
's default was changed from pangoLEARN
mode to Usher
mode quite a while ago (maybe a year?) but GISAID has stuck to pangoLEARN
.
The GISAID assignment issues have led covSpectrum to add Nextclade assignments as an additional option (besides pangolin via GISAID data).
It would indeed be nice if GISAID started using the Usher mode, or if that is considered too computationally burdensome Nextclade (though I think Usher sampled should be much faster than the full version, and GISAID could also make use of the designation cache: I'm sure @AngieHinrichs would be happy to advise)
@rmcolq what's the maintenance status of scorpio? Do you intend to add XBB definitions to prevent misclassification as demonstrated by @sschmutz or is scorpio essentially deprecated now? If the latter, it would maybe be good to communicate that to users, maybe GISAID gets the message that way.
We are planning to depreciate pangolearn soon so that will force GISAIDs hand. I'm happy to update Scorpio with some relevant constellations if required. We thought GISAID were using usher and therefore there didn't seem to be a demand for them.
Thanks @rmcolq for the quick and helpful response!
GISAID doesn't annotate what method they use to create their Pango lineage metadata. But I've noticed in the past that it looked often more like pangoLEARN than Usher.
I've got annotations from self-runs Usher, pangoLEARN, Nextclade and annotations from GISAID as colorings on e.g. this tree: https://next.nextstrain.org/groups/neherlab/ncov/europe?c=pango_default
Here are screenshots of the XBB branch. While there's no 100% correspondence between pangoLEARN and GISAID, GISAID is much closer to pangoLEARN than to Usher or Nextclade - though GISAID seems to be less BA.2'ish (wrong) than pangoLEARN, maybe they run some sort of post-processing, no clue.
Colored by GISAID (through metadata)
colored by pangoLEARN (--analysis-mode pangolearn
)
colored by Usher (pangolin default settings)
@AngieHinrichs will know more about the details of what the pangolin assignments reveal about the pangolin flags with which GISAID runs pangolin.
In terms of intact Spikes that are designated BA.2 but clearly not (as they have Spike consistent with other XBB.1.5 lineage Pango forms, quite ), I've included a short list some of common apparent miscalls below just based on Spike. Improbable BA.2 calls are recently coming up most frequently in Africa, but also South America, Europe, and less frequently but still found on other continents. I wonder if a solution could be that the people at Usher could send their baseline calls and regular updates to GISAID, and GISAID only having to run the newest sequences as they come in to keep the volume manageable? I think is is the volume of sequences is likely to have made running Usher not feasible for them in the past, and PangoLEARN calls are obviously often problematic. GISAID does supply information some information on how they make their calls, by the way. For example: EPI_ISL_17559114 includes this information: Pango Lineage: EG.1 (Pango v.4.2 PANGO-v1.19), Omicron (BA.2-like) (Scorpio)
Also, do people know how Genbank maintain its Pango calls, given that they can change over time?
The trees in the last message are very interesting to help sort things through. Here is just a very simplistic look based on Spike proteins, a list of calls that I think are relatively often (>40 times) misassigned to BA.2 among recent (last 60 days) among intact (seqs with N's excluded) sequences encoding Spikes related to XBB.1.5 sublineages, that gives an impression of level of impact of the BA.2 calls on more newly emergent lineages:
Pango Npango Nform %ofBA.2 Spike-Mutations-Relative-To-XBB.1.5-baseline BA.2 4108 1003 24.4% Spike matches XBB.1.5 BA.2 4108 613 14.9% +[Q613H] Spike matches EG.1 BA.2 4108 167 4.1% +[I410V,P521S] Spike matches EU.1.1 BA.2 4108 144 3.5% +[H146K]-[H146Q] Spike matches some XBB.1.5/XBB.1.9.1/XBB.1.9.2 BA.2 4108 142 3.5% +[F456L] Spike matches XBB.1.5.10 BA.2 4108 88 2.1% +[E180V,T478R]-[T478K] Spike matches XBB.1.16 BA.2 4108 54 1.3% +[Q613H,G1124V] Spike matches EG.1 sublineage BA.2 4108 46 1.1% +[Q675H] Spike matches EL.1
Bette Korber
@corneliusroemer
GISAID doesn't annotate what method they use to create their Pango lineage metadata. But I've noticed in the past that it looked often more like pangoLEARN than Usher.
Note that GISAID does in fact annotate the source of their lineage calls: in the LANL data feed the field is called “Pango version”.
The breakdown from yesterday (count, method):
19153293 PLEARN-v1.19
6445229 PANGO-v1.19
5209215 consensus call
96403 SCORPIO_v0.1.10
33996 PLEARN-v1.18
23774 PLEARN-v1.17
11420 PLEARN-v1.15.1
3404 PLEARN-v1.16
2284 PLEARN-v1.14
300 v1.12
28 PANGO-v1.18
And it’s pretty clear that is the problem for these ``BA.2''s.
Here are the counts for “lineage” and "Pangolin version” fields from our feed for the sequences matching the most-common EG.1 spike:
479 BA.2 PLEARN-v1.19
559 EG.1 consensus call
56 EG.1 PANGO-v1.19
184 XBB.1.5 consensus call
4 XBB.1.5 PANGO-v1.19
1 XBB.1.5.15 consensus call
36 XBB.1.9.1 consensus call
1 XBB.1.9.2 consensus call
So, first, all of these sequences share the exact same Spike (T19I,L24-,P25-,P26-,A27S,V83A,G142D,Y145-,H146Q,Q183E,V213E,G252V,G339H,R346T,L368I,S371F,S373P,S375F,T376A,D405N,R408S,K417N,N440K,V445P,G446S,N460K,S477N,T478K,E484A,F486P,F490S,Q498R,N501Y,Y505H,Q613H,D614G,H655Y,N679K,P681H,N764K,D796Y,Q954H,N969K). But all the ones called by PLEARN (and ONLY those called by PLEARN) have been designated BA.2.
The attached tree shows how these sequences (again, all with the same spike) cluster -- rooted on the outbreak strain, but still sensible, with XBB.1.5 most basal, and XBB.1.9 intermediate between that and a big EG.1/BA.2 clade. Note that all the BA.2 designates (except one down at the base) are promiscuously intermingled with EG.1. And in fact, all of the sequences here share most of their 120-130 mutations, and if you partition tree at the first break, 90% of the sequences above the break share this set of mutations:
have:
> G05720A
> C12789T
> A16878T
> A27507C
> T28297C
> C28928T
lack:
T17124C
Point being, it's a solid clade, but PangoLearn is mis-calling (as we've all realized by now).
As @bettekorber noted (and this is her table), this isn't limited to EG.1:
Pango Npango Nform % HD SpikeMutationsRelativeToXBB.1.5
BA.2 4108. 1003 24.4% 0 (consensus) Spike matches XBB.1.5
BA.2 4108 613 14.9% 1 +[Q613H] Spike matches EG.1
BA.2 4108 167 4.1% 2 +[I410V,P521S] Spike matches EU.1.1
BA.2 4108 144 3.5% 1 +[H146K]-[H146Q] Spike matches some XBB.1.5/XBB.1.9.1/XBB.1.9.2
BA.2 4108 142 3.5% 1 +[F456L] Spike matches XBB.1.5.10
BA.2 4108. 88 2.1% 2 +[E180V,T478R]-[T478K] Spike matches XBB.1.16
BA.2 4108 54 1.3% 2 +[Q613H,G1124V] Spike matches EG.1 sublineage
BA.2 4108 46 1.1% 1 +[Q675H] Spike matches EL
So I'm asking the question in a different way: If GISAID is supposed to stop using PangoLEARN, and start using UShER for everything -- which I'm sure many of us will agree on in principle, how can that best be brought about? My own experience suggests that local UShER set-up is, er, non-trivial (I spent several days on it before giving up), it is computationally demanding, and there are north of 15 million sequences, with 5-40k new ones daily. Is it conceivable that the UShER team could negotiate an interaction with GISAID to provide current updates for new sequences on a regular and ongoing basis, and updated designations for all the old sequences?
@rmcolq -- I don't think "forcing GISAID's hand" is really the right approach. Wouldn't it be better to work with them to help figure out how to get their Pango calls right? It's a real problem, and the community needs us and them to get it right.
The reason behind trying to depreciate Pangolearn is not to force gisaid hand. We are losing the member of our team who used to train the models, recently had the server on which it ran wiped after it became corrupted by power cuts (so it all needs to be set up again), plus the amount of computational resources that the training was taking was hitting the ceiling of what was available on that server so an amount of development is needed to allow the models to continue to train. That amount of work can't easily be justified when there is an alternative which is working better for most people already.
I'm a bit tied up over the next couple of days, but am happy to update Scorpio with some new definitions. I can do this for XBB and the short list of the WHO variants of interest. This will at least improve the current situation. I can also reach out to GISAID again.
@rmcolq -- Thanks for your willingness to reach out to GISAID. If they're made aware that PangoLEARN is going away in time to get alternatives set up (and have assistance with that setup), the transition will be less of a headache for everyone ...
The intention is to have a smooth transition that will just run pusher in the absence of a plearn model with a warning printed to screen, so I suspect it won't be necessary for any headaches!
Small update: I just checked which pango-designation version GISAID is currently using and it looks like GISAID is still on v1.19 and doesn't use the latest release v1.20 (from 2 weeks ago) which is Usher only. None of the designations that happened since release 1.19 (30th of March 2023) can be queried via GISAID.
Here's a subset of lineages that you can't query via GISAID's pango lineage search field as a result:
@AngieHinrichs has GISAID been in touch with you re how to use Usher for calling Pango lineages moving forward?
Nope, not a word from GISAID, despite both Rachel and I emailing them about the changes.
The more GISAID users who contact them about lineage assignments, the better I think.
It would be easy for you and me to just publish a simple csv that maps EPI_ISL to Usher/Nextclade Pango assignment and is up to date with the latest designation releases. Then it would be trivial for GISAID to add these lineage assignments internally. What I'm worried about is that most users won't be aware that they are getting outdated lineage assignments from GISAID. Not much we can do unfortunately.
I feel like we have reached out to GISAID in every way we know how to do. Multiple email addresses/contact forms, multiple people, each with offers to help make it happen. I don't believe they can't do it, or that at this stage the resources are a problem. I don't think there is anything we can do which will help them to make the change short of making pangolin break unless you have a sufficiently up to date version of pangolin-data which would be a breaking change which would affect all other users.
I agree that lack of resources is not a plausible cause at this point. I think they need to hear the message from the right people (i.e. not anyone affiliated with UCSC, Nextstrain, ...).
@bettekorber, I see that you are on GISAID's Scientific Advisory Council: https://gisaid.org/about-us/governance/ . Can you ask GISAID to update its installation of pangolin and all pangolin dependencies to their latest versions, so that the latest released lineages from pango-designation v1.20 may be assigned to GISAID SARS-CoV-2 sequences?
@AngieHinrichs and @corneliusroemer, I think this would be very helpful for everybody (me too) and I joined the council with the hope that I could do what I could bring such suggestions to the discussion, although that is all I can do. I'm sorry I didn't notice this thread before, my friend @willfischer recently pointed it out. Could you create such a public file csv anyway, Cornelius, if you haven't already done? It seems like a very useful file.
Hi @bettekorber! As far as I know, none of us have heard any updates from GISAID regarding updating their installation of pangolin and pangolin-data, but from searching for lineages in EpiCoVTM, it appears that they have upgraded pangolin to the latest version (4.3) and are using the second-latest release of pangolin-data (v1.20, released May 24; the latest, v1.21, was released June 26). I get a reasonable number of results when I search EpiCoVTM for lineages added in pango-designation v1.20 such as EG.2 and DV.6. But when I search for lineages added in pango-designation v1.21 such as XBB.1.5.70 or XCD, there are no results.
Ah, for some very old sequences like EPI_ISL_500000 (submitted 2020-07-28), the pangolin-data version is stated as v1.20:
-- OK, that's because England/LIVE-A263B/2020 (EPI_ISL_500000) was manually designated as B.1.1.253 in pango-designation/lineages.csv, and the PANGO-v1.20 version tag is propagated through from the pangolin output. But for sequences that have not been manually designated and are instead apparently called using usher, instead of using the PUSHER-v1.20 version tag from pangolin output, the description is "consensus call" with no data version:
It's definitely a sign of progress that GISAID seems to have re-run the latest version of pangolin and almost-latest version of pangolin-data on all sequences. Hopefully that means they will start using the latest version of pangolin-data soon and get all caught up. It will be even better if they update the "consensus call" description to include at least the v1.{...} portion of the version tag from pangolin output.
I've noticed that the proportion of
BA.2* + S:452R, S:486V, S:493.
in South Africa (and elsewhere) has been rising recently.These should really be BA.4/5 but are misassigned, either by pangoLEARN or Scorpio.
It's as much as 25% of South African sequences.
https://cov-spectrum.org/explore/South%20Africa/AllSamples/Past3M/variants?aaMutations=S%3A452R%2CS%3A486V%2CS%3A493.&pangoLineage=BA.2*&
Example sequences: