ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.99k stars 12.8k forks source link

Chapter 5, Excercise 10 #341

Open pascalgottret opened 3 years ago

pascalgottret commented 3 years ago

Hello,

to complete this exercise I directly implemented LinearSVC into the coding of Chapter 2. As data I used housing_prepared and as labels I used housing_labels. In the end I just trained the SVM regressor like that:

svm_reg = LinearSVR(random_state=42)
svm_reg.fit(housing_prepared, housing_labels)

The score is quite unusual (compared to RandomForest and Linear Regression):

housing_predictions = svm_reg.predict(housing_prepared)
svm_reg_mse = mean_squared_error(housing_labels, housing_predictions)
svm_reg_rmse = np.sqrt(svm_reg_mse)
svm_reg_rmse

218339.15956036837

Why is it so badly underfitting the data? Why is the error in the exercise 10 so much lower, although the data should be more or less the same (Housing_prepared additionally is scaled, uses the imputer and the OneHotEncoder for Ocean Aprox. and adds some attributes).

thanks

ashishthanki commented 3 years ago

The question states to train a Support Vector Regressor NOT a LinearSVR. There is a significant difference between the two.

To put it simply, imagine fitting a line of best fit into predicting houses prices. We would not expect the house prices to be a linear relationship with any one of the features that we have, concretely, if the number of bedrooms increased would you expect the house price to increase linearly also? Or if the house area increased would the house prices increase linearly too?

Sure, there is a positive relationship between these features but are they linear? Definitely not.

If you do .fit the data with a SVR - don't forget to import first sklearn.svm import SVR - you should see a better model.

Also, the large RMSE error that you have is a great example of poor model selection, assuming the data is linear when the data is actually more complex - everyone has tried to fit a poor model before (me included! ๐Ÿ˜ƒ)

To put it in more technical language this is a prime example of the Bias/Variance Trade-Off, the 'LinearSVR' generalization error is due to incorrect assumptions about the data causing the model to underfit the training data.

P.S. Don't forget to split your data into training, validation and testing datasets - the validation dataset, amongst many other things, is there so you can eliminate models that have large bias errors such as this one.

pascalgottret commented 3 years ago

Many thanks for your explanation although now I am even more confused :) What is the difference? I thought that SVR can be linear and polynomial.

In chapter 5, topic SVM Regression, it tells "You can use Scikit-Learnโ€™s LinearSVR class to perform linear SVM Regression" and in the solution of this exercise the author used LinearSVR(random_state=42) as well.

Also, when using the training data as stated in the solutions I got excellent results. I only got this bad results when using the prepared data from the chapter. This I do not understand.

To put it in more technical language this is a prime example of the Bias/Variance Trade-Off, the 'LinearSVC' generalization error is due to incorrect assumptions about the data causing the model to underfit the training data.

But I do not use the classifier, I used LinearSVR - regressor.

many thanks

ashishthanki commented 3 years ago

Support Vector Regressors can be both linear and polynomial but the LinearSVR is only linear.

the solution of this exercise the author used LinearSVR(random_state=42) as well.

He later used the SVR, which outperformed the LinearSVR.

prepared data from the chapter.

Has your prepared data been scaled and is the pre-processing being performed using column transformer and a pipeline ?

This I do not understand.

Bias/Variance trade off is covered in the book ๐Ÿ‘

But I do not use the classifier.

Oops, that was a typo, has been corrected now ๐Ÿ˜ƒ

pascalgottret commented 3 years ago

all right, thank you very much for your kind help.

Has your prepared data been scaled and is the pre-processing being performed using column transformer and a pipeline ?

Regarding that, I simply followed the steps from Chapter 2 till I got housing_prepared (except CombinedAttributesAdder(), although it doesn't make any difference). That means:

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")), 
    ('std_scaler', StandardScaler()), 
])

And:

from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)

I also tried the SVR with RandomizedSearchCV today but the score is more than 100k. So I guess the problem is the data.

ashishthanki commented 3 years ago

all right, thank you very much for your kind help.

Has your prepared data been scaled and is the pre-processing being performed using column transformer and a pipeline ?

Regarding that, I simply followed the steps from Chapter 2 till I got housing_prepared (except CombinedAttributesAdder(), although it doesn't make any difference). That means:

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")), 
    ('std_scaler', StandardScaler()), 
])

And:

from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)

I also tried the SVR with RandomizedSearchCV today but the score is more than 100k. So I guess the problem is the data.

The data used in chapter 5 is from sklearn which is a prepared version that only requires standard scaling and not OneHotEncoding. Take a look at the notebook: https://github.com/ageron/handson-ml2/blob/master/05_support_vector_machines.ipynb

ageron commented 3 years ago

Thanks for your question @ofe-57 , and thanks to @ashishthanki for the great answers. Is everything clear now @ofe-57?