adriangb / scikeras

Scikit-Learn API wrapper for Keras.
https://www.adriangb.com/scikeras/
MIT License
242 stars 50 forks source link

Can't pass strings as inputs #265

Closed PurpleBooth closed 2 years ago

PurpleBooth commented 2 years ago

I have a net that does some string processing as it's first step, that works with the deprecated wrapper, but not with this (very nice looking) library

model = Sequential(
  [
      layers.Input(shape=(1,), dtype=tf.string),
      text_vectorizer,
      layers.Embedding(max_features + 1, 50 * multiplier),
      layers.Dropout(.1),
      layers.GlobalAveragePooling1D(),
      layers.Dropout(.1),
      layers.Dense(20, kernel_initializer=initializers.random_uniform, activation=activations.swish),
   ]
)
Traceback
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/sklearn/utils/validation.py:787, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    786 try:
--> 787     array = array.astype(np.float64)
    788 except ValueError as e:

ValueError: could not convert string to float: 'two fires american indians civil war laurence hauptman reveals several hundred thousand indians affected civil war twenty thousand indians enlisted sides attempt gain legitimacy autonomy simply land'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Input In [28], in 
      9 classifier = KerasClassifier(X=feature.values.astype(np.str_), y=label.values, model=make_model, batch_size=-1, validation_split=.2, verbose=1, sample_weight=1, )
     10 grid = GridSearchCV(
     11     estimator=classifier,
     12     param_grid={},
     13     verbose=1,
     14 )
---> 15 grid_result = grid.fit(feature.values.astype(np.str_), label.values, callbacks=[callback], verbose=1)
     16 history = grid_result.history_
     18 grid_result

File ~/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/sklearn/model_selection/_search.py:926, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    924 refit_start_time = time.time()
    925 if y is not None:
--> 926     self.best_estimator_.fit(X, y, **fit_params)
    927 else:
    928     self.best_estimator_.fit(X, **fit_params)

File ~/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/scikeras/wrappers.py:523, in BaseWrapper.fit(self, X, y, sample_weight, warm_start, **kwargs)
    520 else:
    521     # No warm start requested
    522     reset = True
--> 523 X, y = self._validate_data(X=X, y=y, reset=reset)
    525 if sample_weight is not None:
    526     sample_weight = _check_sample_weight(
    527         sample_weight, X, dtype=["float64", "int"]
    528     )

File ~/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/scikeras/wrappers.py:383, in BaseWrapper._validate_data(self, X, y, reset)
    363 """Validate input data and set or check the `n_features_in_` attribute.
    364 Parameters
    365 ----------
   (...)
    380     The validated input. A tuple is returned if `y` is not None.
    381 """
    382 if y is not None:
--> 383     X, y = check_X_y(
    384         X,
    385         y,
    386         allow_nd=True,  # allow X to have more than 2 dimensions
    387         multi_output=True,  # allow y to be 2D
    388     )
    389 X = check_array(X, allow_nd=True, dtype=["float64", "int"])
    391 n_features = X.shape[1]

File ~/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/sklearn/utils/validation.py:964, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    961 if y is None:
    962     raise ValueError("y cannot be None")
--> 964 X = check_array(
    965     X,
    966     accept_sparse=accept_sparse,
    967     accept_large_sparse=accept_large_sparse,
    968     dtype=dtype,
    969     order=order,
    970     copy=copy,
    971     force_all_finite=force_all_finite,
    972     ensure_2d=ensure_2d,
    973     allow_nd=allow_nd,
    974     ensure_min_samples=ensure_min_samples,
    975     ensure_min_features=ensure_min_features,
    976     estimator=estimator,
    977 )
    979 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
    981 check_consistent_length(X, y)

File ~/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/sklearn/utils/validation.py:789, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    787         array = array.astype(np.float64)
    788     except ValueError as e:
--> 789         raise ValueError(
    790             "Unable to convert array of bytes/strings "
    791             "into decimal numbers with dtype='numeric'"
    792         ) from e
    793 if not allow_nd and array.ndim >= 3:
    794     raise ValueError(
    795         "Found array with dim %d. %s expected <= 2."
    796         % (array.ndim, estimator_name)
    797     )

ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

There's also a bunch of related warnings

/Users/billie/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/sklearn/utils/validation.py:964: FutureWarning: Arrays of bytes/strings is being converted to decimal numbers if dtype='numeric'. This behavior is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26). Please convert your data to numeric values explicitly instead.
  X = check_array(
/Users/billie/Library/Caches/pypoetry/virtualenvs/tensorflow-part-2-JIKxCiSF-py3.10/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning: 
5 fits failed out of a total of 5.

Version

scikeras 0.6.0 Scikit-Learn API wrapper for Keras.
adriangb commented 2 years ago

We do some preprocessing that the old wrappers don't do, namely calling sklearn's check{X,y} functions that mirrors what other sklearn estimators do internally and increases compatibility. But it looks like one of these doesn't like your string input, I'm just not sure where things are going wrong since there's a lot of layers to work with here (sklearn, scikeras, Keras, tensorflow).

Would you mind providing a minimally reproducible example, ideally with 1 row of input data so that I can debug locally and figure out where things are going wrong?

PurpleBooth commented 2 years ago

Here we go, this is the the word embeddings tutorial cut down. First we run the model to prove it works with plain keras, then we do the same thing with a grid search attempting to find if 15 or 16 epoch is optimum. The relevant error is in the last cell, however I did spot the error in the second to last cell while doing making something that produced the same error

https://gist.github.com/PurpleBooth/ee3a528da422b31131b708d36d5d3eb9

adriangb commented 2 years ago

Thank you for the extra detail!

So first thing to note is that in the second to last cell you are trying to fit an sklearn grid search estimator with a scikeras model passing it tensorflow datasets as inputs. This will never work (no matter what scikeras does) becasue sklearn grid search estimators can't accept tensorflow datasets as inputs. SciKeras may at some point support tensorflow datasets as inputs (see #166 ) but sklearn will never (although some parts of it may still work if they don't actually touch the data).

I'm going to try to make a small sell contained example for the error in the last cell.

adriangb commented 2 years ago

Here's my WIP notebook. I think this reproduces your error right @PurpleBooth ? https://colab.research.google.com/drive/19CuuFhshTpjZ1HGl99sp6mFw7UFHr5F8?usp=sharing

adriangb commented 2 years ago

So fundamentally here's the issue: sklearn expects that by the time the input (X) hits the model it's already numeric. You're supposed to do the conversion within a pipeline, before you get to the model. For example, this tutorial.

But obviously you want to convert the data in your tensorflow model, that's going to be much more efficient.

So there's 2 options here:

  1. Remove the restriction that input data be numeric. I'm not keen on this since it goes against the sklearn API.
  2. Split your Keras model up into two models, one which does the tokenization and another which does the neural net part. Then you'd use the tokenization layer as sklearn transformer in a pipeline and the neural net as your model. I think the outcome would be the same, but you should verify by comparing to Keras. The con here is performance: you now have 2 models instead of one and you're doing more stuff in Python and less in TensorFlow. I think there would be a negative performance impact, but I'm not totally sure.

Let me know if option 2 works and if you can try it. I don't have time today but I can try to put it into a notebook if that helps.

PurpleBooth commented 2 years ago

I had already switched to doing the tokenisation outside the network to get it working, so it's not a big deal, but I think perhaps a note in the migration doc is warranted though, since it's a breaking change between the two APIs

adriangb commented 2 years ago

Yeah fair enough, I'll keep this issue open until I make that change!

adriangb commented 2 years ago

I opened #266, feel free to comment / review that.

Long term, we maybe will support Dataset inputs. But it opens up a whole can of worms I think, so I'm not sure I want to take that on. We'll see.