ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.99k stars 12.8k forks source link

Linear regression example in 2nd Edition book using unprocessed training data #25

Open jsukup opened 5 years ago

jsukup commented 5 years ago

It appears that the data used to test the trained linear regression model on page 75 of the 2nd edition of "Hands-on..." is using the unprocessed housing data frame. If the model was trained with housing_prepared shouldn't the examples (i.e. some_data=housing.iloc[:5]) use the processed data set as well (i.e. some_data=housing_prepared[:5])?

ageron commented 5 years ago

Hi @jsukup , thanks for your question.

Are you referring to this code example?

>>> some_data = housing.iloc[:5]
>>> some_labels = housing_labels.iloc[:5]
>>> some_data_prepared = full_pipeline.transform(some_data)
>>> print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [ 210644.6045  317768.8069  210956.4333  59218.9888  189747.5584]
>>> print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

If so, then notice that it does prepare the data (full_pipeline.transform(some_data)) before it uses the trained model to make predictions (lin_reg.predict(some_data_prepared)).

Hope this helps, Aurélien

huang-jl commented 4 years ago

@ageron Hi! Testing in my own laptop, some_data_prepared (after full_pipeline.transform(some_data)) only contains three different categories, which doesn't match the linear model.

ageron commented 4 years ago

Hi @huang-jl ,

I can see only two explanations:

1) Perhaps your full_pipeline was trained on a part of the dataset that only contained three different categories. Instead, the model should be trained on the full training set (as in the book and the notebook), like in this cell:

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

2) Perhaps you are calling full_pipeline.fit_transform(some_data) instead of full_pipeline.transform(some_data)? If so, then just replace fit_transform() with transform(): we're only supposed to fit the training set.

Hope this helps.

qingchuanzhu commented 4 years ago

I also ran into same problem some_data_prepared only has 3 categories instead of 5 when I first execute the predict(some_data_prepared)

full_pipeline.named_transformers_['cat'].categories_ lists only 3 categories.

However, after I ran the cell mentioned above again, the issue was resolved without any code change and OneHotEncoder now learns that there are 5 categories and the predict works.

This is super weird though...maybe an internal bug from sklearn

Aliiiu commented 4 years ago

20201024_105137

I'm also having this same problem just before tthis code

Jeremiah004 commented 2 years ago

hi, on page 75 of the second version of the book, i am having a problem with loading the dataset, after writing the code for downloading it