ageron / handson-ml3

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
7.45k stars 3.01k forks source link

[Bug] Chpt2: Issue while fetching the output of preprocessing.get_feature_names_out() #42

Closed allosharma closed 1 year ago

allosharma commented 1 year ago

Describe the bug The issue occurs while creating the final preprocessing pipeline before Select and Train a Model topic. Expected output of preprocessing.get_feature_names_out() should be the list of all the features in the preprocessing pipeline, but instead, I get an AttributeError: Transformer geo (type ClusterSimilarity) does not provide get_feature_names_out.

To Reproduce

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, features_names_in):
    return ["ratio"] #features names out

def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy='median'),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler()
    )

log_pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    FunctionTransformer(np.log, feature_names_out='one-to-one'),
    StandardScaler()
)

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
default_num_pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
    )

preprocessing = ColumnTransformer([
    ('bedrooms', ratio_pipeline(), ['total_bedrooms', 'total_rooms']),
    ('rooms_per_house', ratio_pipeline(), ['total_rooms', 'households']),
    ('people_per_house', ratio_pipeline(), ['population', 'households']),
    ('log', log_pipeline, ['total_bedrooms', 'total_rooms', 'population', 'households', 'median_income']),
    ('geo', cluster_simil, ["latitude", "longitude"]),
    ('cat', cat_pipeline, make_column_selector(dtype_include=object)),
],
remainder=default_num_pipeline) #one column remaining: housing_median_age

#This is where the issue happens
preprocessing.get_feature_names_out()

And if you got an exception, please copy the full stacktrace here:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[119], line 1
----> 1 preprocessing.get_feature_names_out()

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\compose\_column_transformer.py:511, in ColumnTransformer.get_feature_names_out(self, input_features)
    509 transformer_with_feature_names_out = []
    510 for name, trans, column, _ in self._iter(fitted=True):
--> 511     feature_names_out = self._get_feature_name_out_for_transformer(
    512         name, trans, column, input_features
    513     )
    514     if feature_names_out is None:
    515         continue

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\compose\_column_transformer.py:479, in ColumnTransformer._get_feature_name_out_for_transformer(self, name, trans, column, feature_names_in)
    477 # An actual transformer
    478 if not hasattr(trans, "get_feature_names_out"):
--> 479     raise AttributeError(
    480         f"Transformer {name} (type {type(trans).__name__}) does "
    481         "not provide get_feature_names_out."
    482     )
    483 return trans.get_feature_names_out(names)

AttributeError: Transformer geo (type ClusterSimilarity) does not provide get_feature_names_out.

Expected behavior It should list of the names of all the features in the pipeline.

Versions (please complete the following information):

ageron commented 1 year ago

Hi @allosharma ,

Thanks for your feedback. Could you please copy/paste the definition of the ClusterSimilarity class from your notebook here? It should contain a get_feature_names_out() method. The error message says that it doesn't, but it's supposed to. If you read page 82 in the book, or look at cell [100] in the notebook, you should see it. Perhaps there's a typo in your notebook? Hope this helps.

allosharma commented 1 year ago

Hi @allosharma ,

Thanks for your feedback. Could you please copy/paste the definition of the ClusterSimilarity class from your notebook here? It should contain a get_feature_names_out() method. The error message says that it doesn't, but it's supposed to. If you read page 82 in the book, or look at cell [100] in the notebook, you should see it. Perhaps there's a typo in your notebook? Hope this helps.

Hi @ageron , Thank you so much for the response, yes I got the typo. while defining the method I had a typo.

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self #always return self

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_features_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

It should have been a feature instead of features.