ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.12k stars 12.91k forks source link

Chapter 2 error during prediction #646

Closed singh-krishan closed 2 years ago

singh-krishan commented 2 years ago

Hi @ageron , I have defined the full pipeline as:

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs),
])

where the num_pipeline is defined as:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scalar', StandardScaler()),
])

Now, when I execute this code:

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

some_data_prepared = full_pipeline.fit_transform(some_data)
print('Predictions:', lin_reg.predict(some_data_prepared))
print('Labels:', list(some_labels))

I get this error:

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 13 is different from 11)

I noticed that when I pass the whole housing dataset to the full_pipeline, the ocean_proximity is getting transformed to 5 different columns resulting in total of 13 fields. But, when I pass only a subset of the dataset (i.e. housing.iloc[:5]), the transformation is not applied to the ocean_proximity column.

Any suggestions on what could be wrong?

Thanks a lot

ageron commented 2 years ago

Hi @singh-krishan ,

Thanks for your question. Make sure you fit estimators only on training data. This means you should call fit() or fit_transform() or fit_predict() only on training data, never on other data (such as the validation set, the test set, or new data). In your code, you should therefore replace full_pipeline.fit_transform(some_data) with full_pipeline.transform(some_data). However, before you do that, you should first fit the model on the training set. So the code should look like:

housing_prepared = full_pipeline.fit_transform(housing)
some_data_prepared = full_pipeline.transform(some_data)

In the full training set, there are 5 distinct values in the ocean_proximity column. That's why after the full_pipeline is fit on the training set, it outputs one-hot vectors of size 5 for each ocean_proximity category. But if some_data is small enough, it is likely to contain less categories, which is what you observed. But if you only call transform(some_data) and not fit_transform(some_data), it will output one-hot vectors of size 5.

Hope this helps.

singh-krishan commented 2 years ago

thanks @ageron , makes sense