Chapter 2. Transformation Pipelines. Value Error

mattbche commented 3 years ago

I tried to run the transformation pipeline in Chapter 2 with no success. I do not know why I am getting a value error. Any help would be appreciated.

    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder',CombinedAttributesAdder()), 
        ('std_scaler', StandardScaler()),
    ])

    housing_num_tr = num_pipeline.fit_transform(w)

And here is the result:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-3a3099ac60f9> in <module>
      4 num_pipeline = Pipeline([(('imputer'), SimpleImputer(strategy="median")),('attribs_adder',CombinedAttributesAdder()),('std_scaler', StandardScaler()),])
      5 
----> 6 housing_num_tr = num_pipeline.fit_transform(w)

~/env/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    376         """
    377         fit_params_steps = self._check_fit_params(**fit_params)
--> 378         Xt = self._fit(X, y, **fit_params_steps)
    379 
    380         last_step = self._final_estimator

~/env/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    305                 message_clsname='Pipeline',
    306                 message=self._log_message(step_idx),
--> 307                 **fit_params_steps[name])
    308             # Replace the transformer of the step with the fitted
    309             # transformer. This is necessary when loading the transformer

~/env/lib/python3.6/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~/env/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    752     with _print_elapsed_time(message_clsname, message):
    753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
    755         else:
    756             res = transformer.fit(X, y, **fit_params).transform(X)

~/env/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    697         if y is None:
    698             # fit method of arity 1 (unsupervised transformation)
--> 699             return self.fit(X, **fit_params).transform(X)
    700         else:
    701             # fit method of arity 2 (supervised transformation)

<ipython-input-43-6189961e6df8> in transform(self, X, y)
     13         population_per_household = X[:, population_ix] / X[:, household_ix]
     14         if self.add_bedrooms_per_room:
---> 15             bedrooms_per_room = X[:,bedrooms_ix] / X[:rooms_ix]
     16             return np.c_[X,rooms_per_household, population_per_household, bedrooms_per_room]
     17         else:

ValueError: operands could not be broadcast together with shapes (20640,) (3,13)

ageron commented 3 years ago

Hi @mattbche,

Thanks for your question. It looks like there's a typo in your definition of the CombinedAttributesAdder class: there's a comma missing on the line of the error. Instead of:

bedrooms_per_room = X[:,bedrooms_ix] / X[:rooms_ix]

It should be:

bedrooms_per_room = X[:,bedrooms_ix] / X[:, rooms_ix]

Here's the full class:

from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

FYI, here's the process I went through to find this bug:

First, I looked at the error message at the bottom of the StackTrace: "ValueError: operands could not be broadcast together with shapes (20640,) (3,13)". This tells me that there's an operation failing because the shapes of the operands don't go well together.
In the StackTrace, I searched (starting from the bottom) for the first piece of code that's not a part of a library (you first want to look for bugs in your own code before looking for bugs in libraries, the error is usually in your own code). In this case, it's the line bedrooms_per_room = X[:,bedrooms_ix] / X[:rooms_ix]. That's where the error is happening.
Since I know there's an operation failing between two operands, I look for the operations and I see that there's a division. So the shape of X[:, bedrooms_ix] must not be compatible with the shape of X[:rooms_ix].
From there, it's easy the typo in the second operand.

I hope this will help you debug future errors!

Closing this issue, but feel free to reopen it if the problem persists. Please make sure you're using the exact same code as in the book. You can check by looking at the notebooks in this project.

Cheers!

mattbche commented 3 years ago

Thank you for the rapid response!

ageron / handson-ml

Chapter 2. Transformation Pipelines. Value Error #619