Closed PeterPann23 closed 5 years ago
Hi @PeterPann23 , thank you for the suggestion, that seems like a positive change to me. cc @shmoradims.
@PeterPann23 , while SDCA sets L1, L2 by default, it is provided so that the user can override that inferred value, if needed, for example, if they want to make the model more sparse or more resilient to noise.
@wschin , would you be able to create a write up about L1/L2 regularization based on this feedback. We can reuse that in multiple locations with L1/L2 regularization. Right now we only have link to wikipedia page.
Suggested description:
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
/// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
/// [Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
/// For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
/// In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting.
/// However, using L2-norm sometimes lead to better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm.
/// Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
/// L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important in practice.
/// For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
We can't directly mention Least Absolute Shrinkage and Selection Operator
because it's a behavior provided by proximal gradient method or similar algorithms. If, for example, sub-gradient method is used, L1-norm does not give us any sparsity.
Let's update the docs to ensure that this is recorded
To be fixed by #3586. The live site content will be updated as part of 1.1 release.
[Enter feedback here] I would provide a reason as for the use of the parameter. if L1 stands for Lasso Regression (Least Absolute Shrinkage and Selection Operator) than one could mention that:
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.