CDCgov / cfa-viral-lineage-model

Apache License 2.0
10 stars 0 forks source link

Simple history visualization #32

Closed afmagee42 closed 3 months ago

afmagee42 commented 3 months ago

This PR adds a simple tool (currently filed in exploration) to visualize lineages over time from the NextStrain data without modeling.

It can make plots like the below, where we focus on a particular time range (here 2022) and filter out lineages not ever seen above some percent (here 10%).

whole_history_0 5

I've eschewed stacked charts so it's a bit easier to see what happens to any particular lineage, because the point of this is for us to choose parts of the life cycle of a lineage to model.

Out-of-scope additions

In making this, I found that some sequences are assigned impossible clades. Like a sequence from 2020 being assigned to 24A. Among all data with valid dates and valid clades, this appears to happen <1% of the time.

I have thus added linmod.data.with_bad_ns_assign() as a function to add a column impossible to a polars dataframe that says whether a lineage assignment is impossible or not.

I have also plugged this into our filtering in linmod.data.main.