Closed qchempku2017 closed 11 months ago
When running test to find the optimal alpha in lasso, I've found that one standard error selection gives smaller alpha than minimum CV selection. This should not be the case. One standard error rule: Minimum CV rule:
It turns out that implementation in model_selection.py is problematic: https://github.com/CederGroupHub/sparse-lm/blob/f7bedb3bd2ca672f13b3547552b6559429c94991/src/sparselm/model_selection.py#L189 Here we used:
params_sum = np.sum(params, axis=0) one_std_dists = np.abs(metrics - m + sig) candidates = np.arange(len(metrics))[ one_std_dists < (np.min(one_std_dists) + 0.1 * sig) ] best_index = candidates[np.argmax(params_sum[candidates])]
in order to find the best alpha. This implementation cannot guarantee that one-std-rule always yields larger alpha than optimum CV rule.
import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import make_regression from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import KFold, train_test_split from sparselm.model_selection import GridSearchCV X, y, coef = make_regression( n_samples=200, n_features=100, n_informative=10, noise=40.0, bias=-15.0, coef=True, random_state=0, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0 ) # create estimators lasso = Lasso(fit_intercept=True) # create cv search objects for each estimator cv5 = KFold(n_splits=5, shuffle=True, random_state=0) params = {"alpha": np.logspace(-1, 1, 10)} lasso_cv_std = GridSearchCV( lasso, params, opt_selection_method="one_std_score", cv=cv5, n_jobs=-1 ) lasso_cv_opt = GridSearchCV( lasso, params, opt_selection_method="max_score", cv=cv5, n_jobs=-1 ) # fit models on training data lasso_cv_std.fit(X_train, y_train) lasso_cv_opt.fit(X_train, y_train) std_cv_mean = -lasso_cv_std.cv_results_["mean_test_score"] std_cv_std = lasso_cv_std.cv_results_["std_test_score"] print("Best params:", lasso_cv_std.best_params_) print("log best param:", np.log(lasso_cv_std.best_params_["alpha"])) print("Best cv:", -lasso_cv_std.best_score_) plt.plot(np.log(params["alpha"]), std_cv_mean, color="k") plt.fill_between(np.log(params["alpha"]), std_cv_mean-std_cv_std, std_cv_mean+std_cv_std) plt.scatter([np.log(lasso_cv_std.best_params_["alpha"])], [-lasso_cv_std.best_score_], color="r", s=100) opt_cv_mean = -lasso_cv_opt.cv_results_["mean_test_score"] opt_cv_std = lasso_cv_opt.cv_results_["std_test_score"] print("Best params:", lasso_cv_opt.best_params_) print("log best param:", np.log(lasso_cv_opt.best_params_["alpha"])) print("Best cv:", -lasso_cv_opt.best_score_) plt.plot(np.log(params["alpha"]), opt_cv_mean, color="k") plt.fill_between(np.log(params["alpha"]), opt_cv_mean-opt_cv_std, opt_cv_mean+opt_cv_std) plt.scatter([np.log(lasso_cv_opt.best_params_["alpha"])], [-lasso_cv_opt.best_score_], color="r", s=100)
When running test to find the optimal alpha in lasso, I've found that one standard error selection gives smaller alpha than minimum CV selection. This should not be the case. One standard error rule:
Minimum CV rule:
![image](https://github.com/CederGroupHub/sparse-lm/assets/28149881/19ba81c5-1edf-4a97-84be-8c1b62a5c6d9)
Expected Behavior
Current Behavior
Possible Solution
It turns out that implementation in model_selection.py is problematic: https://github.com/CederGroupHub/sparse-lm/blob/f7bedb3bd2ca672f13b3547552b6559429c94991/src/sparselm/model_selection.py#L189 Here we used:
in order to find the best alpha. This implementation cannot guarantee that one-std-rule always yields larger alpha than optimum CV rule.
Steps to Reproduce
Context
Detailed Description
Possible Implementation