Open snehitkrishna opened 5 years ago
Hi @snehitkrishna ,
Thanks for your feedback. That's odd, perhaps the lin_reg
model was trained on unprepared data? Please make sure that the training set is prepared in exactly the same way as the new data before training the model. If it still does not work, please get the latest version of notebook 2 and run the cells in order. Hope this helps.
Hi @snehitkrishna
I think this might be due to the fact that some_data
has only 5 rows of data.
So, when the cat_pipeline
is called, and inside that when LabelBinarizer
or whichever encoder you have used in your program is called, it returns a NumPy array with different dimensions than when you used the same pipeline with the full data before.
In my case:
(16512, 16)
(5, 14)
Hi @snehitkrishna ,
Make sure you do not call fit()
or fit_transform()
on some_data
, only transform()
(your code example above does the right thing, but perhaps another part of your code fits some_data
?). If you call fit()
or fit_transform()
then indeed you might get the problem you mention. You only want to fit the training data, no other data, especially not test data. That's the only reason I can see for the pipeline's output shape being different.
For example, check out this code:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
X_train = np.array([["cat", "apple"], ["dog", "cherry"], ["cat", "cherry"]])
encoder = OneHotEncoder()
X_train_encoded = encoder.fit(X_train)
some_data = np.array([["dog", "apple"]])
some_data_encoded = encoder.transform(some_data)
print(some_data_encoded.todense()) # prints [[0. 1. 1. 0.]]
As you can see, there are 22 columns in the output, even though some_data
only has a single value for each column. It learned from the training set that both columns can have two values each, which is why it outputs 22 columns.
Hope this helps.
There are some tests.
some_data = housing.iloc[:6860] some_data_prepared = all_pipeline.fit_transform(some_data) print(some_data_prepared.shape)
output is (6860, 15)
some_data = housing.iloc[:6861] some_data_prepared = all_pipeline.fit_transform(some_data) print(some_data_prepared.shape)
output is (6860, 16)
And this is ISLAND's first appearance. @biplobmanna is right
Hi @SparkOfLife,
Thanks for your comment. As I mentioned in my comment above, you should not use fit_transform()
with some_data
, this is why you are getting an error. Any method with fit
in its name (fit()
, fit_transform()
, fit_predict()
) should only be called on the training data, never on validation, test or new data.
Hello,
I am also getting the same error. I fixed it by using
class PipelineFriendlyLabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
return super(PipelineFriendlyLabelBinarizer, self).fit_transform(X)
I am following the code as per notebook and i am getting some dimension error in cell [81]:
But when i try with the following code it is working fine:
Please help me getting this.