CederGroupHub / sparse-lm

Sparse Linear Regression Models
https://cedergrouphub.github.io/sparse-lm
Other
15 stars 7 forks source link

One standard error search gives smaller alpha than minimum CV search #96

Closed qchempku2017 closed 11 months ago

qchempku2017 commented 11 months ago

When running test to find the optimal alpha in lasso, I've found that one standard error selection gives smaller alpha than minimum CV selection. This should not be the case. One standard error rule: image Minimum CV rule: image

Expected Behavior

Current Behavior

Possible Solution

It turns out that implementation in model_selection.py is problematic: https://github.com/CederGroupHub/sparse-lm/blob/f7bedb3bd2ca672f13b3547552b6559429c94991/src/sparselm/model_selection.py#L189 Here we used:

            params_sum = np.sum(params, axis=0)
            one_std_dists = np.abs(metrics - m + sig)
            candidates = np.arange(len(metrics))[
                one_std_dists < (np.min(one_std_dists) + 0.1 * sig)
            ]
            best_index = candidates[np.argmax(params_sum[candidates])]

in order to find the best alpha. This implementation cannot guarantee that one-std-rule always yields larger alpha than optimum CV rule.

Steps to Reproduce

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import KFold, train_test_split

from sparselm.model_selection import GridSearchCV

X, y, coef = make_regression(
    n_samples=200,
    n_features=100,
    n_informative=10,
    noise=40.0,
    bias=-15.0,
    coef=True,
    random_state=0,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

# create estimators
lasso = Lasso(fit_intercept=True)

# create cv search objects for each estimator
cv5 = KFold(n_splits=5, shuffle=True, random_state=0)
params = {"alpha": np.logspace(-1, 1, 10)}

lasso_cv_std = GridSearchCV(
    lasso, params, opt_selection_method="one_std_score", cv=cv5, n_jobs=-1
)
lasso_cv_opt = GridSearchCV(
    lasso, params, opt_selection_method="max_score", cv=cv5, n_jobs=-1
)

# fit models on training data
lasso_cv_std.fit(X_train, y_train)
lasso_cv_opt.fit(X_train, y_train)

std_cv_mean = -lasso_cv_std.cv_results_["mean_test_score"]
std_cv_std = lasso_cv_std.cv_results_["std_test_score"]
print("Best params:", lasso_cv_std.best_params_)
print("log best param:", np.log(lasso_cv_std.best_params_["alpha"]))
print("Best cv:", -lasso_cv_std.best_score_)

plt.plot(np.log(params["alpha"]), std_cv_mean, color="k")
plt.fill_between(np.log(params["alpha"]), std_cv_mean-std_cv_std, std_cv_mean+std_cv_std)
plt.scatter([np.log(lasso_cv_std.best_params_["alpha"])], [-lasso_cv_std.best_score_], color="r", s=100)

opt_cv_mean = -lasso_cv_opt.cv_results_["mean_test_score"]
opt_cv_std = lasso_cv_opt.cv_results_["std_test_score"]
print("Best params:", lasso_cv_opt.best_params_)
print("log best param:", np.log(lasso_cv_opt.best_params_["alpha"]))
print("Best cv:", -lasso_cv_opt.best_score_)

plt.plot(np.log(params["alpha"]), opt_cv_mean, color="k")
plt.fill_between(np.log(params["alpha"]), opt_cv_mean-opt_cv_std, opt_cv_mean+opt_cv_std)
plt.scatter([np.log(lasso_cv_opt.best_params_["alpha"])], [-lasso_cv_opt.best_score_], color="r", s=100)

Context

Detailed Description

Possible Implementation