alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.
https://baikal.readthedocs.io
BSD 3-Clause "New" or "Revised" License
592 stars 30 forks source link

Next development steps and backwards-incompatible changes #16

Open alegonz opened 4 years ago

alegonz commented 4 years ago

I don't know how many people is using this library, but from now on I'll make an effort to post in advance any new features and changes that I plan to make to the API in this thread.

Please be aware that baikal is still a young project and it might be subject to backwards-incompatible changes. The major version (following semver) is still zero, meaning that any changes might happen at anytime. Currently there is no deprecation policy. I don't think there is a significant user base yet, so development will be rather liberal introducing backward-incompatible changes if they are required to make the API easier to use, handle important use-cases, less error-prone, etc. That said, I'll make an effort to keep the backward-incompatible changes to a minimum.

If you are using baikal (thank you!) I'd suggest doing the following:

Comments and discussions are of course welcome in this thread :)

(This thread was inspired by the one used by the trio project)

alegonz commented 4 years ago

New features and changes planned for 0.3.0

1) Specify function and trainable arguments when calling the step on inputs, and rename function to compute_func.

This will be a backwards-incompatible change, necessary for the other two changes described below.

The idea is that instead of doing this:

step = LogisticRegression(function="predict_proba", trainable=True)(x, y_t)

you would do

step = LogisticRegression()(x, y_t, compute_func="predict_proba", trainable=True)

so that it is possible to call the same step (a shared step) with different behaviors on different inputs. So, for example, learned target transformations would be expressed as:

x = Input()
y_t = Input()
scaler = StandardScaler()
y_t_transformed = scaler(y_t, compute_func="transform", trainable=True)
y_p_transformed = LinearRegression()(x, y_t_transformed)
y_p = scaler(y_p_transformed, compute_func="inverse_transform", trainable=False)  # reuse parameters fitted above

Both compute_func and trainable would be keyword-only arguments. This is to make client code more readable and to allow baikal to change the order in the future without breaking existing code.

The renaming of function to compute_func is to be consistent with the future fit_compute_func argument described below.

2) Make steps shareable.

(See Issue #11 for the original discussion.)

The idea is that steps could be called an arbitrary number of times on different inputs with different behaviors at each call (e.g. trainable + transform function in the first call, non-trainable + inverse transform function in the second call).

The motivation is to allow reusing steps and their learned parameters on different inputs (similar to what Keras do with shared layers). Having shared steps is particularly important for reusing learned transformations on targets like in the example above. Also, this would allow reusing steps like Lambda to apply the same computation (e.g. casting data types, dropping dimensions) on several inputs. Currently, calling a step with new inputs will override the connectivity of the first call, so this is not possible yet. One could perhaps work around this limitation by having a step with pointers to the parameters of an earlier step, but that might end up being unwieldy.

3) Add API support for fit_transform and fit_predict.

(See Issue #13 for the original discussion.)

The motivation is three-fold:

  1. Make custom fitting protocols, such as the common stacking protocol that uses out-of-fold predictions in the first level, possible. (The current stacked classifier example is a naive example that does not use OOF predictions and thus the second level classifier is prone to prefer an overfitted classifier from the first level).
  2. Allow the use of transductive estimators (e.g. sklearn.manifold.TSNE, sklearn.cluster.AgglomerativeClustering).
  3. Leverage estimators that implement a fit_transform more efficient than calling fit and transform separately.

Currently the above is not possible because Model.fit runs each step's fit and predict/transform method separately, making it impossible to control them jointly. To make this kind of training protocol possible, I plan to add a fit_compute API that allows you to have more control on the computation at fit time (*1). The idea is that, for example, in the case of a stacked classifier, you would define the method in the first-level steps like this:

def fit_compute(self, X, y, **fit_params):
    # 1) Train the step as usual, using the full data.
    # This fits the parameters that will be used at inference time.
    super().fit(X, y, **fit_params)

    # 2) Compute cross-validated predictions. These will be passed
    # to the classifier in the next level to be used as features.
    y_p_cv = cross_val_predict(self, X, y, cv=self.cv)
    return y_p_cv

and Model.fit will give precedence to this method when fitting the step. This should allow defining the stacked model once and fitting it with a single call to model.fit, without having to build an train the first and second stages separately.

Analogously to compute_func, a fit_compute_func argument will also be added to Step.__call__ so client code can specify arbitrary methods.

fit_transform (transformers) and fit_predict (classifiers/regressors) are special cases of fit_compute and will be detected and used by Model.fit if the step implements either.

New features and changes planned for ~0.4.0~ 0.5.0 and later