alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.
https://baikal.readthedocs.io
BSD 3-Clause "New" or "Revised" License
592 stars 30 forks source link

Training model stacks #13

Closed ablaom closed 4 years ago

ablaom commented 4 years ago

This example uses baikal to build a stack. Does training this stack follow the "standard" protocol, as described, for example in https://www.kdnuggets.com/2017/02/stacking-models-imropved-predictions.html) ? In particular, is there a division of the input data into folds, with each bottom level model M_i predicting a "feature" column for the arbitrating model using a complement of the data used to train M_i? If not, could you say how the stack is actually trained?

edit That is, the base models deliver "out-of-sample" predictions to the adjudicating model.

ablaom commented 4 years ago

And in any case, can I build my own stack "by hand" using baikal in way that follows the standard protocol?

alegonz commented 4 years ago

@ablaom

This example uses baikal to build a stack. Does training this stack follow the "standard" protocol ... ?

The example is a naive stack. Each model will fit and pass its predictions to the next level using the full dataset, thus the final classifier will be prone to give more weight to an overfit sub-classifier.

And in any case, can I build my own stack "by hand" using baikal in way that follows the standard protocol?

Not yet. At least not in a terse, idiomatic way.

At present, you could something like as follows:

  1. Build the pipeline as in the naive example.
  2. Define a model up to the concatenate preceding the final classifier.
    model1 = Model(x, ensemble_features, y_t)
  3. Generate predictions following the protocol.
    # call model1.fit and model1.predict k-times manually on
    # folds of X_train/y_train, and compute the cross-validated 
    # predictions cv_features
    # (or perhaps use sklearn.model_selection.cross_val_predict)
  4. Train the model using the full data (to fit the parameters used at inference time).
    model1.fit(X_train, y_train)
  5. Define a model from the output of concatenate up to the output of the final classifier.
    model2 = Model(ensemble_features, y_p, y_t)
  6. Train this model on cv_features.
    model2.fit(cv_features, y_train)
  7. Define the full model (the classifiers are already fitted from 4. and 6.)
    model = Model(x, y_p, y_t)
  8. Compute out-of-sample predictions.
    model.predict(X_test)

This is cumbersome, though.

I recognize this is an important protocol that I failed to consider when designing the API. The problem is that currently Model.fit runs each step's fit and predict method separately, making it impossible to control them jointly. To make this kind of training protocol possible, I'm thinking of a fit_predict API that allows you to have more control on the computation at fit time (*1). The idea is that you would define the method in the appropiate steps (perhaps with a mixin) like this:

def fit_predict(self, X, y, **fit_params):
    # 1) Train the step as usual, using the full data.
    # This fits the parameters that will be used at inference time.
    super().fit(X, y, **fit_params)

    # 2) Compute cross-validated predictions. These will be passed
    # to the classifier in the next level to be used as features.
    y_p_cv = cross_val_predict(self, X, y, cv=self.cv)
    return y_p_cv

And Model.fit will give precedence to this method when fitting the step. This should allow defining the stacked model once and fitting it with a single call to model.fit.

*1 fit_predict would be the analogous of fit_transform (which is part of the sklearn API) for classifiers/regressors. Support for fit_transform is also a TODO.

ablaom commented 4 years ago

Many thanks indeed for this comprehensive answer!