Preprocessing TargetEncoder example #8

Open thomasjpfan opened 4 years ago

thomasjpfan commented 4 years ago

When the numerical data is scaled the one hot encoder works pretty well:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml

X, y = fetch_openml("house_sales", as_frame=True, version=2,
X = X.drop(['date'], axis=1)

prep = ColumnTransformer([
    ('encoder', OneHotEncoder(handle_unknown='ignore'), ['zipcode'])
], remainder=StandardScaler())

pipe = Pipeline([
    ('prep', prep),
    ('clf', Ridge())])

scores = cross_val_score(pipe, X, y)

# 0.8037


amueller commented 4 years ago

Thanks. Huh I didn't scale? That's... weird and a pretty obvious oversight. Target encoder doesn't work better though?

thomasjpfan commented 4 years ago

Target encoder does not work better:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import TargetEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml

X, y = fetch_openml("house_sales", as_frame=True, version=2, return_X_y=True)
X = X.drop(['date'], axis=1)
prep = ColumnTransformer([('encoder', TargetEncoder(), ['zipcode'])], 

pipe = Pipeline([('prep', prep), ('clf', Ridge())])
scores = cross_val_score(pipe, X, y)
# 0.7862
amueller commented 4 years ago

huh. Ames housing or melbourne housing then? or Or any of the ones from Gael's paper? Or is that all classification?