L1 and L2 Regularization

PeterPann23 commented 5 years ago

[Enter feedback here] I would provide a reason as for the use of the parameter. if L1 stands for Lasso Regression (Least Absolute Shrinkage and Selection Operator) than one could mention that:

The value L1 helps by adding an absolute value for the magnitude of coefficients as a "penalty" for the loss function. The Lasso will reduce/shrink the weight of the less important features and works well with models that have a large set of features (the value should be higher than 0 and less than or equal to 1, Based on the trainer used the L1 value is inferred based on steps of 0.25f starting at 0f ends at 1f. If you would like to avoid "discovery" and use your own scale than you could enter one here.

Right now the property adds no value to the API also it provides an automatically inferred based on data set. without telling the user.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: c1d926eb-6385-aae6-78f8-88410472625c
Version Independent ID: 5a58a60a-7ba0-8ef9-3537-05bb28678113
Content: SdcaTrainerBase<TOptions,TTransformer,TModel>.OptionsBase.L1Regularization Field (Microsoft.ML.Trainers)
Content Source: dotnet/xml/Microsoft.ML.Trainers/SdcaTrainerBase`3+OptionsBase.xml
Product: dotnet-ml-api
GitHub Login: @sfilipi
Microsoft Alias: johalex

yaeldekel commented 5 years ago

Hi @PeterPann23 , thank you for the suggestion, that seems like a positive change to me. cc @shmoradims.

glebuk commented 5 years ago

@PeterPann23 , while SDCA sets L1, L2 by default, it is provided so that the user can override that inferred value, if needed, for example, if they want to make the model more sparse or more resilient to noise.

shmoradims commented 5 years ago

@wschin , would you be able to create a write up about L1/L2 regularization based on this feedback. We can reuse that in multiple locations with L1/L2 regularization. Right now we only have link to wikipedia page.

wschin commented 5 years ago

Suggested description:

    /// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
    /// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
    /// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
    /// [Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
    /// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
    /// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
    /// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
    /// For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
    /// In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting.
    /// However, using L2-norm sometimes lead to better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm.
    /// Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
    /// L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
    ///
    /// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
    /// Therefore, choosing the right regularization coefficients is important in practice.
    /// For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.

We can't directly mention Least Absolute Shrinkage and Selection Operator because it's a behavior provided by proximal gradient method or similar algorithms. If, for example, sub-gradient method is used, L1-norm does not give us any sparsity.

glebuk commented 5 years ago

Let's update the docs to ensure that this is recorded

shmoradims commented 5 years ago

To be fixed by #3586. The live site content will be updated as part of 1.1 release.

dotnet / machinelearning

L1 and L2 Regularization #3356

Document Details