Create Cross Validation Framework

TrentonBush commented 2 years ago

The core of model building is evaluation - does the new model improve on the old one? Does it generalize outside of the training data? Answering those questions requires an appropriate cross validation framework that emulates the real application.

In our case, fuel price data is redacted for about 1/3 of records, and we want to impute it. The data represent fuel purchases at individual power plants over time.

Outline of Work

[ ] examine and characterize how fuel_price_per_mmbtu goes missing - is it per plant, per year, correlate with other columns, etc
[ ] identify key requirements of cross validation (CV)
[ ] design and implement the CV framework, ideally using sklearn built-ins

Considerations

The primary goal here is to avoid using predictive information that may link train and test sets but will not exist between observed data and imputed data.

For example, about half of the price data are part of long term contracts, which means there is likely an informational link between records part of the same contract. if we take a random row-wise subset of our observations, training set records belonging to that contract will be extra informative about test set records belonging to that same contract. Our model will perform very well according to our cross validation.

But in our actual application, whole plants are redacted, so we will not have access to any values that share contracts. The model will not be able to take advantage of that information and will operate with a) reduced performance, and b) unknown performance, because we did not evaluate it under realistic conditions.

zaneselvans commented 2 years ago

What do we need to define for the cross validation? Can we create a custom method for subsetting the records and just plug that into existing sklearn cross validation infrastructure?

Can we get something naive up and running first just with random subsets before we try and refine things? Is this kind of setup crazy?

params = {
    "hist_gbr__max_depth": [3, 8],
    "hist_gbr__max_leaf_nodes": [15, 31],
    "hist_gbr__learning_rate": [0.1, 1],
}
search = GridSearchCV(pipe, params)
cv = KFold(n_splits=5, shuffle=True, random_state=0)
results = cross_validate(search, frc_data, frc_target, cv=cv)

Looking at how the missingness breaks down by fuels...

natural_gas:

55% (MMBTU)
34% (deliveries)

coal

26% (MMBTU)
29% (deliveries)

petroleum

31% (MMBTU)
41% (deliveries)

zaneselvans commented 2 years ago

Blurgh. Okay, I spent some time re-familiarizing myself with sklearn yesterday evening and finally managed to get an extremely basic model running in this notebook

I couldn't figure out how to pass the sample weights in to the cross validation though. And also I don't know how to evaluate the "test scores." Is it an error metric? Is it supposed to be zero? It does seem consistently indistinguishable from zero.

zaneselvans commented 2 years ago

Suppsedly the HistGBR model gracefully works with NA values, but when I leave NA values in the categorical (string) columns to be encoded, the OrdinalEncoder fails, complaining that

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['NAType', 'str']

Even though OrdinalEncoder can also supposedly handle NA values and retain them to be passed through to the HistGBR model.

zaneselvans commented 2 years ago

It looks like the GroupKFold iterator is the kind of thing that we would need to avoid (for example) learning about prices for a particular (plant_id_eia, fuel_group_eiaepm) combination in the training phase, in a way that isn't applicable with real data (if whole swathes of plant-fuel tend to get redacted. Or maybe GroupShuffleSplit where the groups are defined by plant_id_eia so all train-test is happening across plant boundaries?

TrentonBush commented 2 years ago

Do you have to use OrdinalEncoder or will it take pandas categorical dtypes?

TrentonBush commented 2 years ago

Ya I think there will be a built in that works for us. Just have to see what we need first!

TrentonBush commented 2 years ago

I don't know how to evaluate the "test scores." Is it an error metric?

Yes, there are two things here. 1) what objective function is the GBDT optimizing, and 2) what metric(s) are we using to evaluate the success of the model. The objective and metric are usually the same, but can be different under some circumstances (if you have some complex custom metric that you can't convert to a twice-differentiable objective function, it can hurt performance, for example).

For our purposes I think l1 aka mae (mean absolute error) is a good choice. Minimizing l1 error produces an estimate of the median, compared to l2 aka mse (mean squared error) which produces an estimate of the mean. Because our dataset seems to have some wild outliers, a median is probably a less biased central estimate (unless we can get rid of those outliers).

There are a bunch of other possible objective funcs and metrics out there we can play with, but changing the objective function is changing the purpose of the model. That is a fundamentally different thing compared with changing model "hyperparameters" such as max_depth etc, which change the tactics/implementation of the model.

I couldn't figure out how to pass the sample weights in to the cross validation though.

I think there are two param dicts here, one with lists of hyperparameters to grid search, and one of constant values to pass to the model. If putting sample weights in the second one doesn't work, I'm not sure how to do it. Would have to do some digging

TrentonBush commented 2 years ago

As far as interpreting the error metrics, lower is better, with one caveat. Error on the training set is often smaller than error on the test set. This is called 'overfitting', because the model has essentially memorized fine details of the training data that don't generalize to new data points in the test set. So we want to keep training the model basically until test error stops decreasing or starts increasing again (this is what "early stopping" does in an automated way).

When optimizing hyperparameters, you would usually choose the params that have produced the smallest error on the validation set. An exception might be if the model is unstable in some way, like if a multi-part CV like k-fold shows that though the average error is lowest, there is high variance between folds. This indicates that your CV is probably not set up properly and the folds contain dissimilar data.

zaneselvans commented 2 years ago

Do you have to use OrdinalEncoder or will it take pandas categorical dtypes?

It would be great if it could just take a native categorical type! And they're stored as integers under the hood anyway I think. But all the examples I've seen thus far are still encoding categorical columns, and they seem to have a strong preference for the OrdinalEncoder with this model, since it can happily deal with having all the categories in a single column.

Edit: Indeed, there is native CategoricalDtype support.

zaneselvans commented 2 years ago

Hmm, even using the "native" categorical support you still have to run it through the OrdinalEncoder, but you can have it select the columns that it's applied to based on the dtype of the columns.

Weirdly, it seems like you then have to pass the integer indices of the categorical columns to the model. Is there really no way to just give it the column names and have it pick out the right columns regardless of what order they're showing up in?

TrentonBush commented 2 years ago

Huh, apparently? The LightGBM model was much more user friendly in that regard. It just asked that categorical values had pd.Category dtypes

zaneselvans commented 2 years ago

I probably just don't understand how to use it correctly.

zaneselvans commented 2 years ago

Superceded by #1767

catalyst-cooperative / pudl

Create Cross Validation Framework #1709

Outline of Work

Considerations