Closed dhimmel closed 7 years ago
Also, I'd rather spend more time optimizing alpha
(regularization strength). glmnet in R defaults to trying a sequence of 100 different regularization strengths.
Agree with @dhimmel about ridge/lasso trade-offs. We could ask the user how much they value sparsity vs performance if we can figure out a way that's not too confusing.
So my sense is that l1_ratio should be chosen based on what properties we want the model to have, not based on maximum CV performance.
Agreed!
If we are performing PCA on the expression matrix to create our features, then I am not sure how important sparsity is going to be in the end classifier. This is probably even more true when the number of components we chose is <= 100.
Closed by #114
Thus far we've been using grid search (cross validation) to select the optimal elastic net mixing parameter. For
SGDClassifier
, this mixing parameter is set usingl1_ratio
, wherel1_ratio = 0
performs ridge regularization andl1_ratio = 1
performs lasso regularization.Here's what I'm thinking:
Grid search is not the appropriate way to select the mixing parameter. Ridge (with the optimal regularization penalty,
alpha
) will always perform better than the optimal Lasso. The reason is that there's a cost for the convenience of sparsity. Lasso makes difficult decisions about which features to select. Therefore the sparsity can aid in model interpretation, but weakens performance because identifying only the predictive features is an impossible task.For example, see our grid from this notebook (note this used MAD feature selection to select only 500 features which likely accentuates the performance deficit as
l1_ratio
increases).So my sense is that
l1_ratio
should be chosen based on what properties we want the model to have, not based on maximum CV performance. If we only care about performance, we might as well save ourselves the computation time and always go with ridge or the defaultl1_ratio = 0.15
.l1_ratio = 0.15
can still filter ~50% of features with little performance degradation. But if you want real sparsity (lasso) there's going to be a performance cost -- and the user not grid search will have to make this decision.