FluxML / MLJFlux.jl

Wrapping deep learning models from the package Flux.jl for use in the MLJ.jl toolbox
http://fluxml.ai/MLJFlux.jl/
MIT License
145 stars 17 forks source link

Penalty (wrongly) not multiplied by the relative batch size #213

Closed mohamed82008 closed 2 years ago

mohamed82008 commented 2 years ago

In the following line, I believe the penalty should be multiplied by the relative batch size (batch size / dataset size) such that the expectation of the stochastic gradient evaluated at the same parameter values is proportional to the true gradient.

https://github.com/FluxML/MLJFlux.jl/blob/4aae8c25df008aa5980f3031407c361628ddd6b0/src/core.jl#L39

You all probably know this already but here is the mathematical justification anyways. Let the full batch loss be L(w) for some parameter values w and let the mini-batch loss be the conditional random variable l | w. Additionally, let the batch size be n and the full dataset size be N. The expected value of l | w is:

E(l | w) = n/N * L(w)

Stochastic gradient descent relies on the fact that the expected value of the stochastic objective is proportional to the true full batch objective. Let the full batch objective be:

O(w) = L(w) + penalty(w)

Now let's consider the following 2 mini-batch objectives o | w and their expectations:

The second mini-batch objective is the one whose expectation is proportional to the full batch objective O(w). The first one over-penalises the weights by 1 over the relative batch size.

DilumAluthge commented 2 years ago

cc: @ablaom

ablaom commented 2 years ago

mohamed82008 Great catch! I agree the implementation is incorrect. I don't suppose you would consider making a PR to resolve?

mohamed82008 commented 2 years ago

I can make a PR just wanted to get a pre-approval.

ablaom commented 2 years ago

Awesome. I promise an expedited review.

mohamed82008 commented 2 years ago

PR opened.