amueller / COMS4995-s20

COMS W4995 Applied Machine Learning - Spring 20
https://www.cs.columbia.edu/~amueller/comsw4995s20/
Creative Commons Zero v1.0 Universal
244 stars 114 forks source link

Preprocessing TargetEncoder example #8

Open thomasjpfan opened 4 years ago

thomasjpfan commented 4 years ago

When the numerical data is scaled the one hot encoder works pretty well:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml

X, y = fetch_openml("house_sales", as_frame=True, version=2,
                    return_X_y=True)
X = X.drop(['date'], axis=1)

prep = ColumnTransformer([
    ('encoder', OneHotEncoder(handle_unknown='ignore'), ['zipcode'])
], remainder=StandardScaler())

pipe = Pipeline([
    ('prep', prep),
    ('clf', Ridge())])

scores = cross_val_score(pipe, X, y)
scores.mean()

# 0.8037

@amueller

amueller commented 4 years ago

Thanks. Huh I didn't scale? That's... weird and a pretty obvious oversight. Target encoder doesn't work better though?

thomasjpfan commented 4 years ago

Target encoder does not work better:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import TargetEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml

X, y = fetch_openml("house_sales", as_frame=True, version=2, return_X_y=True)
X = X.drop(['date'], axis=1)
prep = ColumnTransformer([('encoder', TargetEncoder(), ['zipcode'])], 
                         remainder=StandardScaler())

pipe = Pipeline([('prep', prep), ('clf', Ridge())])
scores = cross_val_score(pipe, X, y)
scores.mean()
# 0.7862
amueller commented 4 years ago

huh. Ames housing or melbourne housing then? or https://www.kaggle.com/austinreese/craigslist-carstrucks-data Or any of the ones from Gael's paper? Or is that all classification?