Selecting the elastic net mixing parameter

dhimmel commented 8 years ago

Thus far we've been using grid search (cross validation) to select the optimal elastic net mixing parameter. For SGDClassifier, this mixing parameter is set using l1_ratio, where l1_ratio = 0 performs ridge regularization and l1_ratio = 1 performs lasso regularization.

Here's what I'm thinking:

Grid search is not the appropriate way to select the mixing parameter. Ridge (with the optimal regularization penalty, alpha) will always perform better than the optimal Lasso. The reason is that there's a cost for the convenience of sparsity. Lasso makes difficult decisions about which features to select. Therefore the sparsity can aid in model interpretation, but weakens performance because identifying only the predictive features is an impossible task.

For example, see our grid from this notebook (note this used MAD feature selection to select only 500 features which likely accentuates the performance deficit as l1_ratio increases).

grid

So my sense is that l1_ratio should be chosen based on what properties we want the model to have, not based on maximum CV performance. If we only care about performance, we might as well save ourselves the computation time and always go with ridge or the default l1_ratio = 0.15. l1_ratio = 0.15 can still filter ~50% of features with little performance degradation. But if you want real sparsity (lasso) there's going to be a performance cost -- and the user not grid search will have to make this decision.

dhimmel commented 8 years ago

Also, I'd rather spend more time optimizing alpha (regularization strength). glmnet in R defaults to trying a sequence of 100 different regularization strengths.

cgreene commented 8 years ago

Agree with @dhimmel about ridge/lasso trade-offs. We could ask the user how much they value sparsity vs performance if we can figure out a way that's not too confusing.

gwaybio commented 8 years ago

So my sense is that l1_ratio should be chosen based on what properties we want the model to have, not based on maximum CV performance.

Agreed!

patrick-miller commented 7 years ago

If we are performing PCA on the expression matrix to create our features, then I am not sure how important sparsity is going to be in the end classifier. This is probably even more true when the number of components we chose is <= 100.

rdvelazquez commented 7 years ago

Closed by #114

cognoma / machine-learning

Selecting the elastic net mixing parameter #56