dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

Store batch-partitioned feature and label history #857

Open ecsalomon opened 3 years ago

ecsalomon commented 3 years ago

Our decision not to persist features limits the flexibility and reproducibility of the system. Triage is designed for batch processing, which means that we could follow functional data engineering principles and store batch partitioned feature and label data in partitioned Postgres tables, Redshift, or HDFS. This would make flexible re-testing of models on different label time periods much easier by being able to construct the matrices on the fly at evaluation time without needing to rebuild features and labels and make Rayid's preferred solution for #378 much easier to implement.

Connecting this to #368, if we versioned features on the hash of query logic, aggregation function, aggregation time period, imputation method, etc., we would be able to track how changes in feature definitions between experiments shifted the distributions of features as well as monitor how feature distributions for the same feature definitions change over time (and throw warnings or errors if, e.g., variance on a feature dropped dramatically between batches). Currently, from_obj logic changes are hidden because they affect the experiment hash but not the feature names.

There are some complications to this approach based on how the group + triage typically operate. Data are received and processed in batches from partners, but the definition of a batch in triage is more closely tied to the experiment and experiment run. Storing all experiments or experiment runs as new batches is likely overly redundant. If you change the label definition, do you really need to create a new batch for all of the features? No, but if you rerun the same experiment on new source data, you will. We could consider the hash of the experiment components (e.g., label definition) in a batch definition, but triage has no good way of knowing what batch the source data are at, so it would not have a good basis for knowing when to create a new batch for the same configuration.

A couple of alternatives for this:

In either case, batch_id is added as metadata to the experiment_runs and a batch_metadata table is introduced, potentially subsuming some of the concepts from experiment_runs and/or experiments

ecsalomon commented 3 years ago

What happens to the replace flag under this paradigm? replace indicates that there was an upstream error in the batch process (e.g., an error in cleaning, or PII leakage) and the entire batch (features, labels) and all dependencies (models, evaluations) should be replaced. This is the only time that data should be dropped/updated.

hunterowens commented 3 years ago

Chiming in from beyond the DSSG alum past, but one thing I have done to solve this pattern in my work is rely on Ibis which is basically - what if SQL alchemy, but pandas / data focused, and supports all manner of backends (including on memory pandas DF) but scales to Redshift/postgres/bigquery etc

On Mon, Aug 9, 2021 at 9:22 AM Erika Salomon @.***> wrote:

What happens to the replace flag under this paradigm? replace indicates that there was an upstream error in the batch process (e.g., an error in cleaning, or PII leakage) and the entire batch (features, labels) and all dependencies (models, evaluations) should be replaced. This is the only time that data should be dropped/updated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dssg/triage/issues/857#issuecomment-895360322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHXYXUASNPR7IMPQQ7J4DT376EJANCNFSM5B2CYP4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .