Value Error in Chapter 2

snehitkrishna commented 5 years ago

I am following the code as per notebook and i am getting some dimension error in cell [81]:

# let's try the full preprocessing pipeline on a few training instances

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))

ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1) != 16 (dim 0)

But when i try with the following code it is working fine:

# let's try the full preprocessing pipeline on a few training instances

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
#some_data_prepared = full_pipeline.transform(some_data)
some_data_prepared = housing_prepared[:5]
print("Predictions:", lin_reg.predict(some_data_prepared))

Please help me getting this.

ageron commented 5 years ago

Hi @snehitkrishna , Thanks for your feedback. That's odd, perhaps the lin_reg model was trained on unprepared data? Please make sure that the training set is prepared in exactly the same way as the new data before training the model. If it still does not work, please get the latest version of notebook 2 and run the cells in order. Hope this helps.

biplobmanna commented 5 years ago

Hi @snehitkrishna I think this might be due to the fact that some_data has only 5 rows of data. So, when the cat_pipeline is called, and inside that when LabelBinarizer or whichever encoder you have used in your program is called, it returns a NumPy array with different dimensions than when you used the same pipeline with the full data before.

In my case:

when I use the entire data, after calling the pipeline, my prepared data is of shape: (16512, 16)
when i use only 5 rows of data, after calling the pipeline, the prepared data is of shape: (5, 14)

ageron commented 5 years ago

Hi @snehitkrishna , Make sure you do not call fit() or fit_transform() on some_data, only transform() (your code example above does the right thing, but perhaps another part of your code fits some_data?). If you call fit() or fit_transform() then indeed you might get the problem you mention. You only want to fit the training data, no other data, especially not test data. That's the only reason I can see for the pipeline's output shape being different.

For example, check out this code:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

X_train = np.array([["cat", "apple"], ["dog", "cherry"], ["cat", "cherry"]])
encoder = OneHotEncoder()
X_train_encoded = encoder.fit(X_train)

some_data = np.array([["dog", "apple"]])
some_data_encoded = encoder.transform(some_data)
print(some_data_encoded.todense()) # prints [[0. 1. 1. 0.]]

As you can see, there are 22 columns in the output, even though some_data only has a single value for each column. It learned from the training set that both columns can have two values each, which is why it outputs 22 columns. Hope this helps.

lt-xu commented 5 years ago

There are some tests. some_data = housing.iloc[:6860] some_data_prepared = all_pipeline.fit_transform(some_data) print(some_data_prepared.shape) output is (6860, 15) some_data = housing.iloc[:6861] some_data_prepared = all_pipeline.fit_transform(some_data) print(some_data_prepared.shape) output is (6860, 16) And this is ISLAND's first appearance. @biplobmanna is right

ageron commented 5 years ago

Hi @SparkOfLife,

Thanks for your comment. As I mentioned in my comment above, you should not use fit_transform() with some_data, this is why you are getting an error. Any method with fit in its name (fit(), fit_transform(), fit_predict()) should only be called on the training data, never on validation, test or new data.

guptas08 commented 5 years ago

Hello,

I am also getting the same error. I fixed it by using

class PipelineFriendlyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(PipelineFriendlyLabelBinarizer, self).fit_transform(X)

ageron / handson-ml

Value Error in Chapter 2 #347