cov-lineages / pangoLEARN

Store of the trained model for pangolin to access.
GNU General Public License v3.0
55 stars 13 forks source link

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

Open wodanaz opened 2 years ago

wodanaz commented 2 years ago

Hi folks,

I posted this issue in pangolin and realized that it was the wrong place. Sorry about that.

In our genomic surveillance data, we have been consistently finding that about 5% of samples with high quality sequencing seem to be misclassified as BA.1 even when using the most recent version of pangolin and pangoLEARN. In addition, almost all the misclassified samples BA.1s have a phylogenetic backbone that corresponds to BA.1.1 and they have great calling quality for the R346K mutation. Given the medical importance of that mutation (R346K) I hope this phenomenon can be reviewed .

Example of read depth and quality in one sample

image

and Phylogeny:

and an example of the tree placing of the BA.1 (These samples are confirmed to have the R346K mutation)

image

Thank you so much an keep up with the great work

Warmly,

Alejandro Berrio Duke University

corneliusroemer commented 2 years ago

It's a pangoLEARN issue, not due to wrong designations (which this repo [pango-designation, where it was originally posted] is for).

So I'll transfer it to pangoLEARN.

It's known that pangoLEARN can be wrong in ways that don't really make sense to humans - maybe the decision tree is overfitted.

The standard recommendation is to use Usher mode, which you can enable by appending --usher to your CLI run, like pangolin --usher input.fasta. Usher should have much lower false classification.

You could also try out Nextclade's pango classifier, it should likewise not have problems getting these sequences classified as BA.1.1.X

Soon, pangolin v4 will be released which uses Usher mode by default, so you won't even have to append --usher anymore.

I hope this helps.

wodanaz commented 2 years ago

Fantastic, thank you!

aineniamh commented 2 years ago

I believe adding more representatives into the designations will resolve the pangoLEARN model error, that's how the pangoLEARN issues are usually resolved (the decision tree is definitely over-fit, and the more informative training data it gets the better it does), this is why I suggested porting the issue over to designation. pangolin 4.0 has a new model which is a random forest, which should be less overfit and give more interpretable confidence scores too. Hopefully the usher mode will resolve this all anyway as pangolin 4.0 has just been released.

FYI @corneliusroemer it's not an issue for the pangolin repo, as it's not pangolin software related just for future ref. I appreciate your great explanation above though!

@wodanaz if you still see issues with the assignments with the latest pangolin version, modes and models please let me know and I'll get to the bottom of it! If there's something particularly tricksy going on we might need to add a constellation definition in.

corneliusroemer commented 2 years ago

@aineniamh you're right, I reworded slightly. It was originally in pango-designation but it belongs only here in pangoLEARN, not pangolin.