adriangb / scikeras

Scikit-Learn API wrapper for Keras.
https://www.adriangb.com/scikeras/
MIT License
241 stars 47 forks source link

[feature request] add type hints #201

Open stsievert opened 3 years ago

stsievert commented 3 years ago

A recent example has confusion around the input arguments. This is the code in question:

class Autoencoder(...):
    def score(self, X) -> float:
        reconstructed = self.predict(X)
        return binary_accuracy(X, reconstructed).numpy().mean()

What type is X and reconstructed? The type of the intermediate variable (output of binary_accuracy) can't be a NumPy array because calling .numpy on a NumPy array raises an exception (ValueError). If that's the case, does binary_accuracy accept NumPy inputs and always produce TF Array outputs? Or does self.predict perform that task?

Apparently X and reconstructed are NumPy arrays: https://github.com/adriangb/scikeras/pull/200#discussion_r579582072. That's not obvious to me.

Type hints would greatly help clear this confusion. It'd be useful to encode the information about how the input/output types relate (same type returned?), and what types are allowed as inputs.

stsievert commented 3 years ago

I think this will help with documentation too – I think Sphinx can pull out type information

adriangb commented 3 years ago

Totally agreed, I am a big fan of type-hints. That said, I primarily focus on type hints for functions I am writing, and rely on my IDE to give me information on e.g. the return type of self.predict or binary_accuracy. I don't foresee adding type hints within methods.

does binary_accuracy accept NumPy inputs and always produce TF Array outputs?

I believe it accepts mixed Numpy/Tensor inputs and always outputs a Tensor. At least in this case that is what is happening.

X is not so easy to type because it can be an array, a list, etc. Numpy has an ArrayLike type, but its only available in numpy>=1.20.0 which requires Python 3.7. Both of our main dependencies (Scikit-Learn and TensorFlow) still support 3.6, so I'm holding off on that until we're forced to or decide to move to 3.7.

Other than X's and `y's, most of SciKeras is typed, although very briskly and a lot of improvement is possible. If you can point out specific areas that need attention, I am happy to make the PRs.

Regarding Sphinx, it can indeed pull type hints. I tried switching it over, but the theme I am using right now did not play nicely with it. I was planning on doing a huge-diff PR to strip the type hints from the docstrings and switch Sphinx over to another theme, but I've been holding off to avoid having a ton of merge conflicts with open PRs...

stsievert commented 3 years ago

I think it'd be more useful to use typing.TypeVar rather than ArrayLike. Yes, ArrayLike would be nice, but my main concern is "what's the output type given the input type?" I'd be satisfied with this example:

from typing import TypeVar
import numpy as np
import tensorflow as tf

Tensor = TypeVar("Tensor", np.ndarray, tf.Tensor)

def predict(X: Tensor) -> Tensor:
    return foo(X)

This says "the function predict takes one input (either a NumPy ndarray or a TF Tensor) and returns an object of the same type." Yes, it doesn't have a full definition of what's allowed in NumPy arrays, but I find it sufficient.

adriangb commented 3 years ago

I think this does not necessarily reflect what is happening. predict in SciKeras (and most of Scikit-Learn) accepts any X that can be converted to a Numpy array by numpy.asarray. The return value depends on the model, but is usually a Numpy array, independently of the input. In other words, I believe the signature should be:

from typing import Union

XType = Union[List, pd.DataFrame, np.ndarray]  # etc, many other things probably, aka `ArrayLike`

def predict(X: XType) -> np.ndarray:
    ...

Thus the problem is not that the return type depends on the input type, rather that the input type can be many things, and those things are ultimately defined by Numpy, not SciKeras. To properly define the type, we need to rely on Numpy. I believe that under the hood Numpy uses Protocol, which makes sense because really it is a protocol/interface, not a Union of many types.

Even more accurate would be to say "the return type of predict is the return type of target_transformer.inverse_transform, but I think this is beyond Python's ability to express types.

stsievert commented 3 years ago

By default, does {KerasClassifier, KerasRegressor}.predict return a NumPy array or a TF Tensor? Is this different than the behavior of TF's wrappers in tensorflow.keras.wrappers.scikit_learn?

Does BaseWrapper.predict return a NumPy array or TF Tensor?


I'm mostly concerned with type hints to help developers (like me). I bet for >95% of users, saying a very simple alias of ArrayLike suffices. I think this would suffice: ArrayLike = Union[np.ndarray, pd.DataFrame, List[List[Number]])

adriangb commented 3 years ago

By default, does {KerasClassifier, KerasRegressor}.predict return a NumPy array or a TF Tensor?

Does BaseWrapper.predict return a NumPy array or TF Tensor?

By default, they all return numpy arrays. In fact, tf.keras.Model.predict itself seems to always return numpy arrays (docs).

I think this would suffice: ArrayLike = Union[np.ndarray, pd.DataFrame, List[List[Number]])

I agree that something like this is probably better than nothing. I will make a PR to add something along these lines.

adriangb commented 3 years ago

I realize I now opened #211 which is a duplicate of this. They both have useful information, so I won't close either, but they are duplicated.