Open thomasjpfan opened 4 years ago
Thanks. Huh I didn't scale? That's... weird and a pretty obvious oversight. Target encoder doesn't work better though?
Target encoder does not work better:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders import TargetEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml
X, y = fetch_openml("house_sales", as_frame=True, version=2, return_X_y=True)
X = X.drop(['date'], axis=1)
prep = ColumnTransformer([('encoder', TargetEncoder(), ['zipcode'])],
remainder=StandardScaler())
pipe = Pipeline([('prep', prep), ('clf', Ridge())])
scores = cross_val_score(pipe, X, y)
scores.mean()
# 0.7862
huh. Ames housing or melbourne housing then? or https://www.kaggle.com/austinreese/craigslist-carstrucks-data Or any of the ones from Gael's paper? Or is that all classification?
When the numerical data is scaled the one hot encoder works pretty well:
@amueller