This PR adds a simple tool (currently filed in exploration) to visualize lineages over time from the NextStrain data without modeling.
It can make plots like the below, where we focus on a particular time range (here 2022) and filter out lineages not ever seen above some percent (here 10%).
I've eschewed stacked charts so it's a bit easier to see what happens to any particular lineage, because the point of this is for us to choose parts of the life cycle of a lineage to model.
Out-of-scope additions
In making this, I found that some sequences are assigned impossible clades. Like a sequence from 2020 being assigned to 24A. Among all data with valid dates and valid clades, this appears to happen <1% of the time.
I have thus added linmod.data.with_bad_ns_assign() as a function to add a column impossible to a polars dataframe that says whether a lineage assignment is impossible or not.
I have also plugged this into our filtering in linmod.data.main.
This PR adds a simple tool (currently filed in
exploration
) to visualize lineages over time from the NextStrain data without modeling.It can make plots like the below, where we focus on a particular time range (here 2022) and filter out lineages not ever seen above some percent (here 10%).
I've eschewed stacked charts so it's a bit easier to see what happens to any particular lineage, because the point of this is for us to choose parts of the life cycle of a lineage to model.
Out-of-scope additions
In making this, I found that some sequences are assigned impossible clades. Like a sequence from 2020 being assigned to 24A. Among all data with valid dates and valid clades, this appears to happen <1% of the time.
I have thus added
linmod.data.with_bad_ns_assign()
as a function to add a columnimpossible
to a polars dataframe that says whether a lineage assignment is impossible or not.I have also plugged this into our filtering in
linmod.data.main
.