ddbourgin / numpy-ml

Machine learning, in numpy
https://numpy-ml.readthedocs.io/
GNU General Public License v3.0
15.35k stars 3.72k forks source link

[Question] Why `gamma * beta` stand for ` L2 in LogisticRegression._NLL_grad #53

Closed eromoe closed 4 years ago

eromoe commented 4 years ago

Hello, this is a great project , I am learning how to implement model without sklearn/tensorflow , it really help me a lot .

I have a question on https://github.com/ddbourgin/numpy-ml/blob/4f37707c6c7c390645dec5a503c12a48e624b249/numpy_ml/linear_models/lm.py#L252

Since P-norm is defined as image

l1norms(self.beta) means the sum of all absulote value of each element in self.beta . I don't quite understand why the simple gamma * beta stand for `L2 ?

PS: May I ask what IDE and code document plugin you are using ? I see some annotation don't beyond to latex , it would be nice to see beautiful math symbols than raw latex :)

ddbourgin commented 4 years ago

For linear regression, the l2-regularization term is gamma * np.sqrt(beta @ beta) The gradient of l2 penalty wrt beta is then simply gamma * beta

Keep in mind that d_penality is the gradient of the penalty term wrt the coefficients, not the penalty itself :)

I don't use a special IDE, unfortunately. the equations are formatted for display as Sphinx reStructuredText. You can see the rendered equations in the online documentation, or build it yourself from the source in the docs directory. There may also be IDE plugins that will try to render them, but I am not aware of any :)

eromoe commented 4 years ago

@ddbourgin Thank you for reply .

From https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261

image

l1-regularization term is gamma * np.absolute(beta) l2-regularization term is gamma * np.power(np.sqrt(beta @ beta), 2) (I think you miswrote in previous comment )

The gradient of l1 penalty wrt beta is then gamma * np.sign(beta) The gradient of l2 penalty wrt beta is then gamma * 2beta proportional to gamma * beta .

Actually I thought l2-regularization term was gamma * np.sqrt(beta @ beta) , so the gradient of l2 term is +- 1 too .Because sometimes I thought L2 norm was beta^2 , sometimes it was np.sqrt(beta^2) in my brain , l2 norm and l2-regularization term` are so likely and mess up , now I have figure it clear .

But there is a left problem : why you multiply l1norm(beta) in L1 case ? since the gradient of l1 penalty is gamma * np.sign(beta) , this confused me .

ddbourgin commented 4 years ago

Whoops, yup, that's what I get for being hasty! The regularization penalty is (gamma / 2) * np.sqrt(beta @ beta) ** 2, which gives a gradient of gamma * beta.

In the L1 case, I'd recommend explicitly writing down the L1 penalty (not just the l1 norm) and then trying to derive the gradient wrt beta. It should quickly become clear why there is an l1norm term in the calc :)

eromoe commented 4 years ago

@ddbourgin Sorry but I don't quite understand why penalty in L1 case need square as L2 does

penalty = 0.5 * self.gamma * np.linalg.norm(self.beta, ord=order) ** 2   #  remaid square under l1 case

All ariticles I saw was using a L1 term (penalty) like image And the derivative is +-\lambda . Now I am very confusing .

ddbourgin commented 4 years ago

Oh! I see what you're saying. You're right, the square of the L1 norm is not what we want. The proper L1 penalty is

gamma * np.abs(beta).sum()

which gives a gradient of

gamma * np.sign(beta)

I'll make a PR to fix this. Thank you very much for pointing this out :)