cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

Probability values #142

Closed noranekonobokkusu closed 2 years ago

noranekonobokkusu commented 3 years ago

Hello,

I've just noticed all the reported class probabilities are equal to 1.0 in my run for 220k genomes from GISAID. In earlier releases of pangolin, I was getting some probabilities other than 1.0 for some samples so now I am concerned a bit. I've downloaded my dataset from GISAID in early January and masked a highly-homoplasic 11,083 position, but still, I was expecting to have in my dataset samples with a conflicting set of mutations, rare enough not to found any new lineage. Is it still an expected behavior to get all 1.0?

Thank you!

aineniamh commented 3 years ago

Hi @noranekonobokkusu, not expected to get all 1.0 values! @emilyscher has been working on this and can maybe explain a little where the values are coming from!

emilyscher commented 3 years ago

Hi @noranekonobokkusu, its possible, but not certain, that this is expected behaviour. The probability scores for decision trees aren't hugely useful. They represent the fraction of samples of a particular class which are in the same leaf. If the model only knows of one class that could possibly be the right fit for a sample, then the score will be 1.

Also, the model only looks at certain places in the genome, specifically those which have been found to have a SNP in any of the training data sequences. It's possible that some of your mutations are therefore not being looked at currently, but would hopefully be taken into account next time there is an updated model.

However, it is odd that you've seen sequences which used to have scores <1 which are now getting 1s. This could well be expected behaviour, but if you could send me an example sequence for which this has happened I can investigate further.

noranekonobokkusu commented 3 years ago

Hi @emilyscher and @aineniamh, thank you for the reply! I looked into samples that had <1.0 probabilities in earlier releases and it seems those were samples that got either reassigned to a more specific lineage (like EPI_ISL_482341 from B.1 to B.1.161) or got consolidated in the same lineage (like EPI_ISL_508878 in B.1 or EPI_ISL_428856 in B.6). The older results come from 2020-08-29 pangolearn version. The majority of samples still had 1.0 scores so it's indeed might be not as surprising.

Still, looking at results produced on my dataset by the current release (mine is 2021-01-06) I noticed a couple of things.

  1. some of the sequences that initially belonged to the A lineage are now classified as B (e.g. EPI_ISL_454974 or EPI_ISL_451365). I looked into decision tree rules and noticed that position 28144 is not used, while it determines GISAID clade S and (i thought so) PANGOLIN lineage A. Is it intended? I'm asking because these reclassified sequences group phylogenetically with lineage A sequences and an older classification was consistent with that.

  2. I am particularly interested in Russian sequences. They were previously mostly classified as B.1.1, and that made sense - we have relatively low coverage of the virus population that was introduced at various timepoints early in the pandemics and as it seems evolved unnoticed, and GISAID for a long time didn't have enough data from Russia for Russian samples to found Russian-specific B.1.1 descendant lineages. Now more than a thousand Russian samples in my dataset (a half of them available from GISAID already) are classified as B.1.1.74 although they are pretty diverse. I looked into the rules file again and found five definitions for B.1.1.74. Of them, I made a list of variants present in all five definitions and counted occurences of those variants in Russian B.1.1.74 sequences (I only considered "==" definitions). These are results:

    ['10452', "'G'"] 0
    ['1209', "'C'"] 0
    ['12502', "'G'"] 0
    ['15932', "'A'"] 1095
    ['15971', "'-'"] 0
    ['16888', "'-'"] 0
    ['18743', "'A'"] 840
    ['19169', "'A'"] 0
    ['19649', "'-'"] 0
    ['19892', "'G'"] 0
    ['21303', "'A'"] 1050
    ['2557', "'A'"] 1095
    ['27318', "'G'"] 0
    ['28882', "'A'"] 1095
    ['5054', "'A'"] 1093
    ['5699', "'A'"] 0
    ['8658', "'-'"] 0
    

    All of the variants present in Russian sequences are reference variants except for 28882 which defines B.1.1 (phylogenetically). Does it mean all those sequences are still closer to B.1.1.74 definitions than to any of B.1.1 definitions? Might it be that something happened with B.1.1 definitions? (there are 228 of them but even 28882=='A' is present in 13 definitions only)

Thanks a lot!

noranekonobokkusu commented 3 years ago

Hi @emilyscher and @aineniamh,

I just wanted to ask whether any changes are expected anytime soon in the way the lineages I've mentioned are classified. I'm very sorry for being persistent, I'd just like to understand how to proceed with these lineages in our project. I can in theory roll back to one of the previous releases but because the current results looked suspicious I thought it's something that might be fixed in newer releases.

Thanks!

aineniamh commented 3 years ago

Hi @noranekonobokkusu, we've recently had an update to pangolin and now have multiple inference engine options. The decision tree model is pretty fragile and can be sensistive to missing data both in the input training set and also in the query sequences. The new release now has scorpio outputs which are explicit SNP based calls for specific variants of interest, and we also now have an --usher mode that uses maximum parsimony to assign lineages. The values we're outputting now also reflect 'conflict' rather than saying probability which is misleading. We also output values that reflect the number of ambiguous sites used in the decision (Full details on the output columns here: https://cov-lineages.org/pangolin_docs/output.html).

Sorry for the delay, this version has been in the works for some time now and we were just trying to pull everything together. Updates still to come are for pangoLEARN @emilyscher has been working on a new model, a hierarchical logistic regression that should also be less fragile and provide a more interpretable probability score output.