Open pascalgottret opened 3 years ago
The question states to train a Support Vector Regressor NOT a LinearSVR. There is a significant difference between the two.
To put it simply, imagine fitting a line of best fit into predicting houses prices. We would not expect the house prices to be a linear relationship with any one of the features that we have, concretely, if the number of bedrooms increased would you expect the house price to increase linearly also? Or if the house area increased would the house prices increase linearly too?
Sure, there is a positive relationship between these features but are they linear? Definitely not.
If you do .fit
the data with a SVR
- don't forget to import first sklearn.svm import SVR
- you should see a better model.
Also, the large RMSE
error that you have is a great example of poor model selection, assuming the data is linear when the data is actually more complex - everyone has tried to fit a poor model before (me included! ๐)
To put it in more technical language this is a prime example of the Bias/Variance Trade-Off, the 'LinearSVR' generalization error is due to incorrect assumptions about the data causing the model to underfit the training data.
P.S. Don't forget to split your data into training, validation and testing datasets - the validation dataset, amongst many other things, is there so you can eliminate models that have large bias errors such as this one.
Many thanks for your explanation although now I am even more confused :) What is the difference? I thought that SVR can be linear and polynomial.
In chapter 5, topic SVM Regression, it tells "You can use Scikit-Learnโs LinearSVR
class to perform linear SVM Regression" and in the solution of this exercise the author used LinearSVR(random_state=42)
as well.
Also, when using the training data as stated in the solutions I got excellent results. I only got this bad results when using the prepared data from the chapter. This I do not understand.
To put it in more technical language this is a prime example of the Bias/Variance Trade-Off, the 'LinearSVC' generalization error is due to incorrect assumptions about the data causing the model to underfit the training data.
But I do not use the classifier, I used LinearSVR
- regressor.
many thanks
Support Vector Regressors can be both linear and polynomial but the LinearSVR is only linear.
the solution of this exercise the author used LinearSVR(random_state=42) as well.
He later used the SVR
, which outperformed the LinearSVR
.
prepared data from the chapter.
Has your prepared data been scaled and is the pre-processing being performed using column transformer
and a pipeline
?
This I do not understand.
Bias/Variance trade off is covered in the book ๐
But I do not use the classifier.
Oops, that was a typo, has been corrected now ๐
all right, thank you very much for your kind help.
Has your prepared data been scaled and is the pre-processing being performed using column transformer and a pipeline ?
Regarding that, I simply followed the steps from Chapter 2 till I got housing_prepared
(except CombinedAttributesAdder()
, although it doesn't make any difference). That means:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
And:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
I also tried the SVR
with RandomizedSearchCV
today but the score is more than 100k. So I guess the problem is the data.
all right, thank you very much for your kind help.
Has your prepared data been scaled and is the pre-processing being performed using column transformer and a pipeline ?
Regarding that, I simply followed the steps from Chapter 2 till I got
housing_prepared
(exceptCombinedAttributesAdder()
, although it doesn't make any difference). That means:from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ('std_scaler', StandardScaler()), ])
And:
from sklearn.compose import ColumnTransformer num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] full_pipeline = ColumnTransformer([ ("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs), ]) housing_prepared = full_pipeline.fit_transform(housing)
I also tried the
SVR
withRandomizedSearchCV
today but the score is more than 100k. So I guess the problem is the data.
The data used in chapter 5 is from sklearn which is a prepared version that only requires standard scaling and not OneHotEncoding. Take a look at the notebook: https://github.com/ageron/handson-ml2/blob/master/05_support_vector_machines.ipynb
Thanks for your question @ofe-57 , and thanks to @ashishthanki for the great answers. Is everything clear now @ofe-57?
Hello,
to complete this exercise I directly implemented LinearSVC into the coding of Chapter 2. As data I used
housing_prepared
and as labels I usedhousing_labels
. In the end I just trained the SVM regressor like that:The score is quite unusual (compared to RandomForest and Linear Regression):
Why is it so badly underfitting the data? Why is the error in the exercise 10 so much lower, although the data should be more or less the same (Housing_prepared additionally is scaled, uses the imputer and the OneHotEncoder for Ocean Aprox. and adds some attributes).
thanks