ENH: Simplified API for Data Conversions

adriangb / scikeras

Scikit-Learn API wrapper for Keras.

https://www.adriangb.com/scikeras/

MIT License

242 stars 50 forks source link

ENH: Simplified API for Data Conversions #83

Closed adriangb closed 4 years ago

adriangb commented 4 years ago

SciKeras manipulates input data (X and/or y) to make it agree with what Keras expects and to allow users to implement multi input/output models.

I'm opening this issue to centralize discussion surrounding the data transformation API. As discussed in #78 and #79, this API is convoluted and could use improvement. In principle, this is because we do not know anything about the loss function until the Model is compiled, but determining what loss to use may require information about the input data. We need to break this cyclical dependency if we want a clear API. I can think of several ways to do this:

interpret loss=None as "please select an appropriate loss for me given this input data". Data reshaping to match the loss function (ex: one-hot encoding for categorical_crossentropy) will only happen when SciKeras is allowed to determine the loss function.
KerasClassifier/Regressor (but not BaseWrapper) only work with the new model compiling API (#66). This allows the user to select the loss function but we can use information from that to manipulate the input data since we have it before calling model_build_fn.

Both of these initiatives could help address https://github.com/adriangb/scikeras/pull/66#pullrequestreview-495121341.

Neither of these options will likely work for advanced cases (like custom losses or multi input/output models) but in those cases, the users should be advanced enough to handle customizing the data pre/post processing themselves.

Separately, I feel that we should combine {pre,post}process_{X,y}, _check_output_model_compatibility and _utils.LabelDimensionTransformer into a single transformer. We can make these transformers __init__ params for the wrappers (which would allow on-the-fly overriding) or class attributes (which would be similar to the current subclassing interface).

```python3 class DataTransformerBase(ABC, TransformerMixin, BaseEstimator): def get_meta_params(self) -> Dict[str, Any]: # allows retrieval of `n_classes_` and such return {...} def fit(self, x: ndarray) -> None: return def transform(self, x: ndarray) -> Union[ndarray, Iterable[ndarray], Dict[ndarray]]: return x def inverse_transform(self, x: Union[ndarray, Iterable[ndarray], Dict[ndarray]]) -> ndarray: return np.column_stack(x) class KerasClassifierFeatureTransformer(DataTransformerBase): def __init__(self, loss: Union[Dict[str, Loss], Union[Loss, Iterable[Loss]]]): self.loss = loss def fit(self, x: ndarray) -> None: ... def transform(self, x: ndarray) -> Union[ndarray, Iterable[ndarray], Dict[ndarray]]: ... def inverse_transform(self, x: Union[ndarray, Iterable[ndarray], Dict[ndarray]]) -> ndarray: ... KerasClassifier( model=get_model, loss=None, feature_transformer=KerasClassifierFeatureTransformer ) ```

I'm open to any input on these intiatives.

stsievert commented 4 years ago

I thought about this more, and think the API on master is decent for the user:

Data processing functions ({pre,post}process_{X,y}) are public methods of BaseWrapper (allows the user to customize them for their specific problem).
BaseWrapper: no processing (which enables users to handle their own data and do their own processing).
KerasClassifier and KerasRegressor have special handling for their particular problems (like binary/multiclass/etc, which enables Scikit-learn compatibility).

Some internal refactoring to make the logic simpler might be warranted. This might involve making Keras-specific data processing transformers public, as discussed below.

I feel that we should combine {pre,post}process_{X,y}, _check_output_model_compatibility and _utils.LabelDimensionTransformer into a single transformer.

How far could that go? Wouldn't the transformer depend on the loss function? I'm imagining a refactoring like this:

class KerasClassifier(BaseWrapper):
    def preprocess_y(self, y):
        ests = {"binary": BinaryClassifierTransformer, "multiclass": MultiClassifierTransfomer, ...}
        est = ests[target_type]()
        return est.fit_transform(y), est.meta_params()

If implemented, I think these transformers should be public in case users want to use them.

adriangb commented 4 years ago

How far could that go?

The hope would be to encapsulate all data conversions into a single transformer. {pre,post}process_{X,y} conceptually map 1:1 with transform and inverse_transform and in fact are largely based on calling a series of transformers. They also throw values back to the fit/main loop just so that they can be saved and then passed to the counterpart, while if the transformers were classes instead of functions, they'd be able to store that state internally.

Wouldn't the transformer depend on the loss function?

Yes, but I feel that this problem (i.e. the circular dependency) is the same for the current function-based implementation and a transformer-based implementation. I think that regardless of which solution we end up with, we need to figure out how to solve this problem.

I envision something like this:

transformers.py (or something like that; see bottom of https://github.com/adriangb/scikeras/issues/83#issue-708598300 for more detail)

class KerasClassifierFeatureTransformer(DataTransformerBase):

    def __init__(self, loss):
        self.loss = loss   # compiled loss function/functions

    def fit(self, x: ndarray) -> None:
        ...

    def transform(self, x: ndarray) -> Union[ndarray, Iterable[ndarray], Dict[ndarray]]:
        ...

    def inverse_transform(self, x: Union[ndarray, Iterable[ndarray], Dict[ndarray]]) -> ndarray:
        ...

in wrappers.py:

class KerasClassifier(BaseWrapper):
    def fit(X, y, ...):
        ...
        X, y = self._validate_data(X=X, y=y, reset=reset)  # line ~ 607 in master
        self.data_transformer_ = self.data_transformer()  # data_transformer is a class attribute? and __init__ param?
        X = self.data_transformer_.fit_transform(X)
        self.loss_ = ...  # compile losses here? only if user gave loss=None? otherwise don't reshape data?
        self.target_transformer_ = self.target_transformer(loss=self.loss_)
        y = self.target_transformer_.fit_transform(y)
        ...

    def predict(X):
       X = self.data_transformer_.fit_transform(X)
       y = self.model_.predict(X)
       return self.target_transformer_.inverse_transform(y)

stsievert commented 4 years ago

Got it. I see what you're saying now. Why aren't Scikit-learn pipelines a suitable replacement for {pre,post}proceses_X?

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression

X, y = make_regression()
X[:, 0] += 400

est = Pipeline([
    ("normalize", StandardScaler()),
    ("keras", MLPRegressor()),
])
est.fit(X, y)

This would also work with post-processing to, as long as the user defined transform or fit_transform. Here's an example with images:

Autoencoder example

``` python from sklearn.preprocessing import MinMaxScaler, FunctionTransformer from sklearn.datasets import load_digits from sklearn.decomposition import PCA X, y = load_digits(return_X_y=True) normalize = MinMaxScaler() # normalize all pixels to (0, 1) est = Pipeline([ ("normalize", normalize), ("autoencoder_standin", PCA()), ("unnormalize", FunctionTransformer(normalize.inverse_transform)), ]) est.fit(X, y) ``` I'm not integrating SciKeras in this example, or showing the [Keras autoencoder implementation][2] – autoencoders only require `transform = predict` in their class implementation.

compile losses here? only if user gave loss=None? otherwise don't reshape data?

In my experience, users with their own model know what loss function they're minimizing (but to be fair I study optimization). From that, I think loss=None should raise a ValueError (and that's not to say KerasClassifier/Regressor can't have a sensible default values for loss).

adriangb commented 4 years ago

Why aren't Scikit-learn pipelines a suitable replacement for {pre,post}proceses_X?

I don't fully understand how your examples would replace {pre,post}proceses_X, unless we had KerasClassifierTargetEncoder (say that's the transformer that absorbs what {pre,post}process_y does) like this:

est = Pipeline([
    ("normalize", KerasClassifierTargetEncoder()),
    ("keras", MLPRegressor()),
])
est.fit(X, y)

If I understood correctly, I'd say the main issue with that is that the wrappers need access to meta parameters / context of the transformer, eg. to get the number of classes (which would normally be needed to build the output of the model).

But more generally, the output from {pre,post}proceses_y can be a list of arrays, or even lists of arrays of different length, depending on the model's output[s].* If we pass some funky data like this into fit it will end up in _validate_data which will either raise an error or coerce the data back into a single numpy array. So clearly this has to be done after that step.

users with their own model know what loss function they're minimizing

I generally agree. I guess we could just look at the value of loss!=None as follows: (1) loss is a known loss function (namely categorical_crossentropy, binary_crossentropy or sparse_categorical_crossentropy and their method/class counterparts): transform data to match (2) loss is not a known loss function: do the rest of the data conversion, but do not adjust for the loss function. This includes loss=None

loss=None should raise a ValueError

I think that if anything, we can check self.model_.loss, that way if users compile a model within build_fn they won't get an error, and we don't break any APIs.

* Keras supports a lot more data formats and target formats than ScikitLearn. The default implementation in {pre,post}proceses_{X,y} is converts common ScikitLearn task-types to what a basic Keras model implementing them would expect, thus easing the learning curve for ScikitLearn users. But eventually these users may want to try more advanced stuff, or even stuff that ScikitLearn doesn't really support. Then they can override these methods and output whatever format their model needs. Thus the output from {pre,post}proceses_{X,y} should be expected to be arbitrary data and it should be fed into the model untouched (except for that annoying windows dtype bug we did a workaround for).

stsievert commented 4 years ago

I don't fully understand how your examples would replace {pre,post}proceses_X, unless we had KerasClassifierTargetEncoder (say that's the transformer that absorbs what {pre,post}process_y does) like this:

If I understood correctly, I'd say the main issue with that is that the wrappers need access to meta parameters / context of the transformer, eg. to get the number of classes (which would normally be needed to build the output of the model).

I'm not saying anything about processing y for now; I'm only concerned with the processing of X.* I think {pre,post}process_X can be replaced with Scikit-learn pipelines if the population of meta is handled elsewhere. Here's the implementation I'm thinking of:

class BaseWrapper:
    def array_meta(self, X, y):
        return {"X.dtype": X.dtype, ...}

    def _build_keras_model(self, X, y):
        meta = self.array_meta(X, y)
        ...
        model = final_build_fn(..., meta=meta)
        ...
        return model

    def preprocess_y(self, y):
        ...  # still required

    def postprocess_y(self, y):
        ...  # still required

    preprocess_X = postprocess_X = None  # not required

Why isn't that valid? It'd still allow the user to customize metadata via the same method (subclassing). If https://github.com/adriangb/scikeras/issues/83#issuecomment-701137200 is incorporated into the documentation, it'd encourage the use of Scikit-learn transforms.

I think that if anything, we can check self.model_.loss, that way if users compile a model within build_fn they won't get an error, and we don't break any APIs

:+1:

^{*I specifically chose KerasRegressor in https://github.com/adriangb/scikeras/issues/83#issuecomment-701137200 towards that purpose but didn't clearly express that fact.}

adriangb commented 4 years ago

I see. But what problem does that solve?

{pre,post}process_X is not meant to be a replacement for Sklearn pipelines/transformers. Outputs are explicitly inputs to Keras Model's fit and predict, which are different from valid Sklearn Xs. You can't have a transformer in your pipeline that outputs a list of arrays.

That said, I do see how it could help to clarify in the docs that for creating a self-contained model users should edit {pre,post}process_X but for general transformations of their data, they should use Sklearn pipelines like you show above. We don't want users thinking that {pre,post}process_X are a good place to implement general data preprocessing.

Edit: accidentally closed, didn't mean to.

adriangb commented 4 years ago

Going back to the discussion of how to break the dependency between the target encoding and loss function, I'm thinking the way to go is to use the loss parameter as follows:

(1) loss={known_loss_fn}. KerasClassifier uses this to encode the input. The values in meta (y_shape_, etc.) correspond to this final encoding. We check that the actual loss function (model_.loss) matches the user-inputted loss function and warn if it does not (no error, that would break the old API and prevent porting a lot of code). (2) loss!={known_loss_fn}. Do not transform the input to match the loss function. If model_.loss turns out to be a known loss function, show a warning (no error, that would break the old API and prevent porting a lot of code).

Do you think this sounds reasonable @stsievert ?

stsievert commented 4 years ago

To be clear, this discussion if for KerasClassifier and BaseWrapper/KerasRegressor aren't relevant, correct? And the relevant code is in the _check_output_model_compatibility and preprocess_y KerasClassifier methods?

Why is it necessary to make the processing code specific to each loss function? Why does some of that happen in _check_output_model_compatibility instead of in preprocess_y? Here's the code I'm thinking of:

https://github.com/adriangb/scikeras/blob/b8f1b63febfb8dfb3c29a3bd6055fc430c8f9f3b/scikeras/wrappers.py#L1051

That's the only line of code I'm seeing that ties the encoding to the loss function. Is that correct?

If I were developing SciKeras, I'd first try to make KerasClassifier a class that a user could feasibly write. I'd probably make some public utilities to make doing basic data manipulation data easier. I'm be inclined to write some code like this:

import scikeras.utils as utils

class KerasClassifier(BaseWrapper):
    def array_metadata(self, X, y):
        n_classes = len(self.encoder_.classes_)
        ...
        return {"X.dtype": X.dtype, ..., "n_classes": n_classes}

    def get_encoders(self, X, y):
        target_type = ...
        transformers = {
            "binary": utils.KerasBinary,
            "multiclass": utils.KerasMultiClass,
            "multilabel-indicator": utils.KerasMultiIndicator,
            "multiclass-multioutput": utils.KerasMultiClassMultiOut,
        }
        return utils.IdentityTransform(), transformers[target_type]()

    def _preprocess_y(self, y):
        self.encoder_ = self.encoders(X, y)
        return self.encoder_.fit_transform(y)

    def _postprocess_y(self, y):
        return self.encoder_.inverse_transform(y)

I think all the logic in KerasClassifier.preprocess_y could be hidden in utils.Keras*. Then if a user wants to use other parts of KerasClassifier (i.e, the score or predict_proba methods) they only have to write this class:

class CustomClassifier(KerasClassifier):
    def get_encoders(self, X, y):
        encoders = super().get_encoder(X, y)
        return encoders[0], CustomEncoder()

(edited to include get_encoders)

adriangb commented 4 years ago

Why is it necessary to make the processing code specific to each loss function?

The practical reason is that that is what the original wrappers did: https://github.com/tensorflow/tensorflow/blob/22d6b01650422e88de48f9baf7a2eb16e6f894a4/tensorflow/python/keras/wrappers/scikit_learn.py#L159. I believe that the intention was to make it easy for users since in sklearn world targets aren't usually one-hot-encoded and they might be confused for multilabel-indicator. But perhaps this is an API that we should break? It doesn't seem that bad to do a single if check in preprocess_y (which is what the implementation would look like if we use the __init__ loss param instead of model_.loss) (more below)

Why does some of that happen in _check_output_model_compatibility instead of in preprocess_y?

Because previously there was no loss parameter, so the only reliable way to to check the loss function was model_.loss, which you don't have until after preprocess_y.