Additional challenges related to feature engineering code

A common ML workflow I've observed includes transforming raw data into model ready data ("feature engineering"). This includes transformations/aggregations (e.g., average of the last 5 values, transforming a timestamp to days of the week, etc) and pre-processing specifically required for an ML algorithm (e.g., scaling numeric values to 0 to 1, etc).

While pre-processing can (and is) often be done within the model's graph (e.g., as part of the sci-kit learn pipeline object, etc), transformations/aggregations almost always happen outside of the model itself.

Often, the nature of the data used for model training (offline, processed using a analytical database or big data tool like MapReduce) implies that the feature engineering code is different between model training and model inference (particularly if model inference is done with real-time data sourced directly from the application to the model server's API). Specifically, the logic must be identical in order for the model to work, requiring careful attention to be paid by data scientists & ml engineers.

This creates significant complexity (for CI/CD processes and beyond). To start the conversation, I tried to outline a few of the challenges here.

Note 1: I debated creating a separate rows in the table for "feature engineering" but instead tried to fit into the existing rows. Very open to feedback here! Note 2: I believe the best practice is to have a single implementation of feature engineering logic, but that the tooling/infrastructure we have today isn't mature enough to allow this to happen in all cases.

cdfoundation / sig-mlops

Additional challenges related to feature engineering code #57