cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

Remove max_count/max_lineage 'voting' logic from usher_parsing #521

Closed AngieHinrichs closed 1 year ago

AngieHinrichs commented 1 year ago

Finally getting around to something I've been meaning to do since #492: removing the logic that overrides usher's tie-breaker logic with the plurality of lineage placements in case of multiple placements in different lineages. For example, usher might find 3 equally parsimony-optimal placements (EPPs), one in BA.5 and two in BA.5.2 -- and initially I thought that would mean it's more likely that the sequence fits in BA.5.2, but with increasing amplicon dropout problems over time, sometimes it simply means that the sequence happens to have Ns in places that allow it to be placed in different parts of BA.5.2 even if it doesn't necessarily have the BA.5.2-defining mutation. The more uncertain the placement is, the more speculative the "voting" is, and the better usher's tie-breaker (which I think favors the branch with more descendants, usually the more basal branch) seems to do.

I tested this on GISAID seqs with IDs in the range EPI_ISL_15340000-15349999 and it behaved as expected, leaving most assignments unchanged but no longer assigning the lineage with the most EPPs in several cases.

@rmcolq feel free to review the changes or not depending on time / interest. I will merge it in a couple days if I don't hear otherwise.

After this is merged, may I tag a pre-release?

If the next pangolin-data release does not include the pangoLEARN *.joblib files then it will require pangolin v4.3, so I think it would be better to release pangolin v4.3 at least a day before the next pangolin-data release (which is still probably at least a week away). I don't anticipate any problems from using pangolin v4.3 with the current release of pangolin-data.