Open pmccaffrey6 opened 1 month ago
Thanks for writing this up! While not fully advertised just yet, there is some naive support for binning which comes close to your described example. Adding the block below into the data_usage configuration will create a new column BinnedAge which will be available in cohort selection (no display_name would overwrite/transform Age) creating bins with inner edges described:
cohorts:
- source: Age
display_name: Binned Age
splits:
- 18
- 35
- 50
- 65
As mentioned, what is there is pretty minimal, so lets keep this conversation in more of a design-mode for a bit (don't want wasted effort prototyping) to pinpoint the deficiencies here.
There is a surprising amount to balance, such as what fits in this package vs preprocessing ahead of data load.
We are still figuring out how to effectively communicate our plans, but do have some ideas on improving how dataframe filtering is passed around which will potentially impact implementation as well.
Thanks a lot, @diehlbw, this is very helpful. Totally agree on the magnitude of balancing between what's in the package vs preprocessing. I would be happy to remain in a design mode on this.
I think perhaps a useful rule for when something could live within the package vs in preprocessing could be whether it would reasonably be iterated and tweaked in the process of evaluation explicitly. With age as an example, I could definitely see iterating over binning schemes as an AI evaluation task to profile performance which seems more active than preprocessing.
Problem Summary
Currently, it appears that variables like age are expected to be cut into ranges prior to processing in seismometer. However, it could very well be a tunable knob how age is cut into ranges. This seems more fitting to happen within seismometer such that direct fact tables can be supplied for predictions and events and seismometer can transform variables like age into ranges itself. These ranges can be configured in a yaml file as a data preprocessing step.
Impact
It seems best to separate input data from data processing. Here, if input data is expected to be already transformed to some extent (such as having age broken up into ranges), then this adds a potential complexity when aggregating data from multiple sources or sites. Also, it would make sense for some model testing workflows to try different ways to cut up age into different ranges and so this sort of abstraction seems best fit to happen within seismometer itself. This would make for a cleaner separation between input data and model testing (which would especially help in situations like multi-site data aggregation) and it would allow seismometer's analysis to include things like testing different age breakups.
Possible Solution
seismometer may include a small preprocessing function that works off of a preprocessing yaml. This could include simple things like creating new derived columns as a result of preprocessing. This could cover transforming raw age values into age ranges (as an example) with a simple configuration like:
Steps to Reproduce
load a predictions table with more than 25 different age values
Suggested fix
See possible solution above. I will try to work on a PR for this. I think it could be a function like
sm.preprocess(<parquet_file>)
into which is passed either a prediction or an events parquet. This function would be controlled by its ownpreprocess.yaml
which would specify the column-to-column transformations. In the case of numerical binning, this could closely wrap pandaspd.cut
.