Closed PurpleBooth closed 2 years ago
We do some preprocessing that the old wrappers don't do, namely calling sklearn's check{X,y} functions that mirrors what other sklearn estimators do internally and increases compatibility. But it looks like one of these doesn't like your string input, I'm just not sure where things are going wrong since there's a lot of layers to work with here (sklearn, scikeras, Keras, tensorflow).
Would you mind providing a minimally reproducible example, ideally with 1 row of input data so that I can debug locally and figure out where things are going wrong?
Here we go, this is the the word embeddings tutorial cut down. First we run the model to prove it works with plain keras, then we do the same thing with a grid search attempting to find if 15 or 16 epoch is optimum. The relevant error is in the last cell, however I did spot the error in the second to last cell while doing making something that produced the same error
https://gist.github.com/PurpleBooth/ee3a528da422b31131b708d36d5d3eb9
Thank you for the extra detail!
So first thing to note is that in the second to last cell you are trying to fit an sklearn grid search estimator with a scikeras model passing it tensorflow datasets as inputs. This will never work (no matter what scikeras does) becasue sklearn grid search estimators can't accept tensorflow datasets as inputs. SciKeras may at some point support tensorflow datasets as inputs (see #166 ) but sklearn will never (although some parts of it may still work if they don't actually touch the data).
I'm going to try to make a small sell contained example for the error in the last cell.
Here's my WIP notebook. I think this reproduces your error right @PurpleBooth ? https://colab.research.google.com/drive/19CuuFhshTpjZ1HGl99sp6mFw7UFHr5F8?usp=sharing
So fundamentally here's the issue: sklearn expects that by the time the input (X
) hits the model it's already numeric. You're supposed to do the conversion within a pipeline, before you get to the model. For example, this tutorial.
But obviously you want to convert the data in your tensorflow model, that's going to be much more efficient.
So there's 2 options here:
Let me know if option 2 works and if you can try it. I don't have time today but I can try to put it into a notebook if that helps.
I had already switched to doing the tokenisation outside the network to get it working, so it's not a big deal, but I think perhaps a note in the migration doc is warranted though, since it's a breaking change between the two APIs
Yeah fair enough, I'll keep this issue open until I make that change!
I opened #266, feel free to comment / review that.
Long term, we maybe will support Dataset inputs. But it opens up a whole can of worms I think, so I'm not sure I want to take that on. We'll see.
I have a net that does some string processing as it's first step, that works with the deprecated wrapper, but not with this (very nice looking) library
Traceback
There's also a bunch of related warnings
Version