abjer / isds2020

Introduction to Social Data Science 2020 - a summer school course abjer.github.io/isds2020
58 stars 92 forks source link

Inverted validation curve #51

Open aabk-bkaa opened 4 years ago

aabk-bkaa commented 4 years ago

After fitting our model it appears that our validation curve is inverted:

image

The validation RMSE is systematically lower than the training RMSE which does not make intuitive sense to us.

The modelling was produced with the following code:

` X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=1)

lambdas = np.logspace(0, 8, 12)

folds = KFold(n_splits = 5) MSE_list =[]

for _lambda in tqdm(lambdas): pipe_preproc = make_pipeline(PolynomialFeatures(2),StandardScaler(), Lasso(alpha = _lambda, max_iter = 1000)) MSE_train = [] MSE_list_intermediate = []

for train_index, val_index in tqdm(folds.split(X_train,y_train)):

    X_tr, y_tr = X_train.iloc[train_index], y_train.iloc[train_index]
    X_val, y_val = X_train.iloc[val_index], y_train.iloc[val_index]

    MSE_list_intermediate.append(mse(y_val,pipe_preproc.fit(X_tr,y_tr).predict(X_val))**(1/2))

    MSE_train.append(mse(y_train,pipe_preproc.fit(X_tr,y_tr).predict(X_train))**(1/2))

MSE_list.append([_lambda] + MSE_list_intermediate + [np.mean(MSE_list_intermediate)] + [np.mean(MSE_train)])

MSE = pd.DataFrame(MSE_list) MSE.columns = ["Lambda", "Fold 1", "Fold 2","Fold 3","Fold 4","Fold 5","Mean_RMSE", "Mean_RMSE_Evaluation"]

MSE.to_excel("LASSO_output.xlsx") `

Can anybody help us.

Kind regards Anton and Søren

jsr-p commented 4 years ago

hi @aabk-bkaa, assuming that you did not plot the data and label the curves incorrectly, there could be other reasons for the RMSE being lower on the validation data than on the training data. See: https://stats.stackexchange.com/questions/187335/validation-error-less-than-training-error