Age (and other variable) transformations: can this be done within seismometer?

pmccaffrey6 commented 1 month ago

Problem Summary

Currently, it appears that variables like age are expected to be cut into ranges prior to processing in seismometer. However, it could very well be a tunable knob how age is cut into ranges. This seems more fitting to happen within seismometer such that direct fact tables can be supplied for predictions and events and seismometer can transform variables like age into ranges itself. These ranges can be configured in a yaml file as a data preprocessing step.

Impact

It seems best to separate input data from data processing. Here, if input data is expected to be already transformed to some extent (such as having age broken up into ranges), then this adds a potential complexity when aggregating data from multiple sources or sites. Also, it would make sense for some model testing workflows to try different ways to cut up age into different ranges and so this sort of abstraction seems best fit to happen within seismometer itself. This would make for a cleaner separation between input data and model testing (which would especially help in situations like multi-site data aggregation) and it would allow seismometer's analysis to include things like testing different age breakups.

Possible Solution

seismometer may include a small preprocessing function that works off of a preprocessing yaml. This could include simple things like creating new derived columns as a result of preprocessing. This could cover transforming raw age values into age ranges (as an example) with a simple configuration like:

- derived_column: AgeBracket
   derived_column_type: int64
   source_column: Age
   source_column_type: int64
   # there can be a set of controlled transformation verbs
   transformation: Bin
   # these can have standard kwargs
   bin_width: 10

Steps to Reproduce

load a predictions table with more than 25 different age values

Suggested fix

See possible solution above. I will try to work on a PR for this. I think it could be a function like sm.preprocess(<parquet_file>) into which is passed either a prediction or an events parquet. This function would be controlled by its own preprocess.yaml which would specify the column-to-column transformations. In the case of numerical binning, this could closely wrap pandas pd.cut.

diehlbw commented 1 month ago

Thanks for writing this up! While not fully advertised just yet, there is some naive support for binning which comes close to your described example. Adding the block below into the data_usage configuration will create a new column BinnedAge which will be available in cohort selection (no display_name would overwrite/transform Age) creating bins with inner edges described:

cohorts:
  - source: Age
    display_name: Binned Age
    splits:
      - 18
      - 35
      - 50
      - 65

As mentioned, what is there is pretty minimal, so lets keep this conversation in more of a design-mode for a bit (don't want wasted effort prototyping) to pinpoint the deficiencies here.
There is a surprising amount to balance, such as what fits in this package vs preprocessing ahead of data load.

We are still figuring out how to effectively communicate our plans, but do have some ideas on improving how dataframe filtering is passed around which will potentially impact implementation as well.

pmccaffrey6 commented 1 month ago

Thanks a lot, @diehlbw, this is very helpful. Totally agree on the magnitude of balancing between what's in the package vs preprocessing. I would be happy to remain in a design mode on this.

I think perhaps a useful rule for when something could live within the package vs in preprocessing could be whether it would reasonably be iterated and tweaked in the process of evaluation explicitly. With age as an example, I could definitely see iterating over binning schemes as an AI evaluation task to profile performance which seems more active than preprocessing.

epic-open-source / seismometer