cov-lineages / pangolin-data

Repository for storing latest model, protobuf, designation hash and alias files for pangolin assignments
GNU General Public License v3.0
28 stars 2 forks source link

Overclassification of BA.5 in pangolin 4 #7

Open donutbrew opened 2 years ago

donutbrew commented 2 years ago

(Not sure if this is the right issue tracker for this, so please direct me to the right place if not)

Currently, the a large number of sequences that Pangolin 4.05 (data 1.3, --skip-scorpio) classified as BA.5 are missing both S:L452R and S:F486V. (194 BA.5 accessions with Wuhan alleles attached).

Removing --skip-scorpio pushes 111 of these into BA.2, 24 into BA.3, 1 into BA.1, and 44 into Unassigned, and 14 remain BA.5.

What is going on here? I'm now a little confused as to whether --skip-scorpio should be the default behavior or not. Happy to have some discussion.

accessions.txt

@AngieHinrichs

AngieHinrichs commented 2 years ago

Thanks for reporting this @donutbrew! Short version:

  1. It gets better with the v1.6 release of pangolin-data (and pangolin-assignment if you're using the assignment cache feature), which came out earlier today -- please try again after pangolin --update-data.
  2. It gets better still with a minor bugfix to usher - not bioconda-available yet, but I can share a linux binary if you would like to try it.
  3. --skip-scorpio is described in the --help message as a 'developer option'. As I understand it, it is intended for dev & testing (or I suppose possible short-term use in case a problem has been identified with scorpio/constellations, but a fix has not yet been released). I believe it's better in general to go with the default behavior of letting Scorpio override the inference method when there's disagreement because Scorpio is checking for specific mutations.

Longer version to follow.

corneliusroemer commented 2 years ago

@donutbrew can you share the complete pangolin command you're running?

And the output of pangolin --all-versions? Thanks!

AngieHinrichs commented 2 years ago

pango-designation is the best repository for reporting pangoLEARN assignment problems because we normally address those by adding more designated sequences to pango-designation/lineages.csv (or changing the designations in the file).

However, problems with UShER analysis mode are usually not directly caused or remedied by the specific sequences in pango-designation/lineages.csv. Usually they're caused by my processes that update the UCSC/UShER tree and distill it down to the minimal tree distributed via the pangolin-data repo. There's not a proper repository for those (just a messy bucket of scripts and unpublished notes). Rarely there may also be an issue with usher or pangolin, but I think it'll almost always be a data problem.

pango-designation has many watchers, and I think they are mostly interested in new lineage proposals so it would be nice to limit other traffic. I propose using the pangolin-data repository for reporting UShER mode assignment problems because that's where pangolin gets the UShER tree, and where an updated tree can hopefully address the problems. I will transfer this issue to the pangolin-data repo.

AngieHinrichs commented 2 years ago

using the latest data

If you're looking for BA.5 sequences specifically then I suggest you use pangolin's assignment cache mode because I used the very latest fixed usher to generate it. (Also, if you are running pangolin on thousands of sequences, the assignment cache makes it faster.) To add the assignment cache to your installation of pangolin, run

pangolin --add-assignment-cache

Again, I recommend also running pangolin --update or pangolin --update-data to get the v1.6 assignment cache released today. Then, to use the assignment cache, run pangolin with the --use-assignment-cache flag:

pangolin --use-assignment-cache input.fasta ...

@aineniamh has recently added BA.4 and BA.5 to scorpio/constellations; running pangolin --update will make sure that you have the latest version of scorpio as well.

usher

The minor bug in usher caused it to sometimes assign the lineage of a node when the sequence had almost but not quite all of the node's mutations. In the UCSC/UShER tree, BA.5 is placed on a long branch from BA.2. So as you observed, many sequences that were really more like BA.2 than BA.5 could be assigned BA.5 despite not having quite all of its mutations.

That bug has been fixed in the latest usher source code, but there has not yet been a new release (and after a new release, there is also a short delay before the new release is available). If you are running on Linux and would like to try my updated usher binary, you can try it like this:

conda activate pangolin
curl -O https://hgwdev.gi.ucsc.edu/~angie/usher.c7117a
chmod a+x usher.c7117a
mv $CONDA_PREFIX/bin/usher $CONDA_PREFIX/bin/usher.bak
mv usher.c7117a $CONDA_PREFIX/bin/usher

your sequences

I was able to find the hashes for 181 of the 194 IDs in local assignment cache files for v1.6, computed before and after the minor bug fix to usher. Here is a 3-column tab-separated file with those sequences' names/IDs, v1.6 pre-bugfix assignment, and v1.6 post-bugfix assignment: nameToLin.v1.6.beforeAfterUsherFix.txt

Here are the counts of each lineage assigned before the usher bugfix:

     97 BA.2
     81 BA.5
      1 BA.2.3
      1 BA.2.10
      1 B.1.1.529

-- so with v1.6 data and without the bugfix, 81 are still assigned BA.5, but at least that's better than 181. :)

Here are the counts of each lineage assigned after the usher bugfix:

    178 BA.2
      1 BA.5
      1 BA.2.3
      1 B.1.1.529

The lone sequence still assigned BA.5 is SouthAfrica/CERI-KRISP-K038411/2022 (EPI_ISL_11621351). Nextclade also calls that BA.5 but with 16 reversions (including T22917G/S:L452R and T23018G/S:F486V) -- that's a lot. I exclude any sequence from the big UCSC tree if Nextclade assigns it an Omicron lineage but has more than 5 reversions. I guess although the sequence fits poorly with the BA.5 node, it fits better there than at any other node in the Nextclade and minimal UCSC/UShER trees.

donutbrew commented 2 years ago

Thanks Angie (I somehow missed this repo, so thanks also for redirecting)

To follow up @corneliusroemer here is the version output:

pangolin: 4.0.5
pangolin-data: 1.3
constellations: v0.1.7
scorpio: 0.3.17
pangolin-assignment: v1.3

I've run pangolin in several ways. My understanding was that running with --no-scorpio was more or less the way forward, in terms of having a common language between pangolin users.

pangolin --skip-scorpio --outfile skip-scorpio_asn.csv seq.fasta
      1 B.1.1.529
      1 BA.2
    192 BA.5

pangolin --outfile default_asn.csv seq.fasta
      1 BA.1
    111 BA.2
     24 BA.3
     14 BA.5
     44 Unassigned

pangolin --use-assignment-cache --skip-scorpio --outfile use-cache_asn.csv seq.fasta
    194 BA.5

pangolin --use-assignment-cache --outfile use-cache-noskip_asn.csv seq.fasta
      1 BA.1
    111 BA.2
     24 BA.3
     14 BA.5
     44 Unassigned
aineniamh commented 2 years ago

Hi @donutbrew, we have scorpio in use to give exact SNP-threshold based assignments for VOCs specifically. I'd recommend not skipping it and am curious why you think it should be the default setting?

dbtara commented 2 years ago

@aineniamh that recommendation is my fault. an earlier version of scorpio was overwriting recombinants and a few other lineages and --skip-scorpio was recommended. We are testing the upgraded version both ways to see how they perform

donutbrew commented 2 years ago

Probably over-discussion and under-sleep. :)

Thanks for the explanations here. We'll test the new versions soon.

aineniamh commented 2 years ago

Ah that makes sense! The XE recombinant constellation files have been added into latest versions, but I understand now. Recombinant assignments are tricky- if scorpio does overwrite, the notes column will report what the original assignment was too. I know that's not ideal, but at least gives a start for the moment!

corneliusroemer commented 2 years ago

I ran your sequences threw Nextclade and they are almost all problematic - they all have far too many reversions. So I wouldn't really put too much weight on any lineage assignment. It's just a guess, hard to say anything definite.

image

So I guess, yes, Usher should maybe not have called most of these BA.5 but then this is a bit of a case of garbage in garbage out.