YAML-run pipeline, part 2 (data preprocessing)

CDCgov / cfa-viral-lineage-model

Apache License 2.0

9 stars 0 forks source link

YAML-run pipeline, part 2 (data preprocessing) #28

Closed thanasibakis closed 3 weeks ago

thanasibakis commented 4 weeks ago

This resolves #26.

Until now, we've been missing the ability to easily configure how data is preprocessed.

The linmod.data script now accepts a single commandline argument for the path to a YAML file configuring its behavior. This is optional; without it, the default behavior (seen in the dictionary linmod.data.DEFAULT_CONFIG) will be used. The YAML file only needs to define the keys it wants to modify from default; missing keys will be populated with the default values.

An example is given in present-day-forecasting/config.yaml. As described in the README, this is run as python3 linmod.data config.yaml.

afmagee42 commented 4 weeks ago

While we're working on configurability, we also want to be able to configure an analysis to filter not just to data relevant to the forecast date (date <= forecast_date) but also data available before the forecast date (date_submitted <= forecast_date).

afmagee42 commented 4 weeks ago

We should also find a way to keep the forecast date around as a date, so that we can choose to plot things on a non-arbitrary-time-axis (that is, against something other than $-30 \leq t \leq 14$).

thanasibakis commented 4 weeks ago

We should also find a way to keep the forecast date around as a date, so that we can choose to plot things on a non-arbitrary-time-axis (that is, against something other than − 30 ≤ t ≤ 14 ).

Done :)

thanasibakis commented 4 weeks ago

While we're working on configurability, we also want to be able to configure an analysis to filter not just to data relevant to the forecast date (date <= forecast_date) but also data available before the forecast date (date_submitted <= forecast_date).

Done. And now that we're on this topic, I've updated the preprocessing script to give us two datasets for a given horizon [forecast_date - L, forecast_date + H]:

An evaluation dataset with all sequences collected and reported within this horizon
A modeling dataset with only sequences collected and reported within the subinterval [forecast_date - L, forecast_date]