andreped / GradientAccumulator

:dart: Accumulated Gradients for TensorFlow 2
https://gradientaccumulator.readthedocs.io/
MIT License
50 stars 10 forks source link

Better compatibility with batch normalization #47

Closed andreped closed 1 year ago

andreped commented 1 year ago

As you all know, gradient accumulation is not directly compatible with batch normalization, as batch normalization will be updated for every single forward step and we cannot control that externally.

In order to get the same behaviour as for gradient updates, we will likely need to implement a custom batch normalization layer which does this internally, as overloading the original batch norm step seems challenging (due to its extreme complexity).

andreped commented 1 year ago

I implemented a custom batch norm layer recently, which was made available in v0.3.2.

Implementation can be seen here.

The plan is to add accumulation to the call step similar as done in the Model wrapper.

The performance on the custom layer reaches very similar results to keras' BN layer, but we should get identical before we can say that it is working as intended.

andreped commented 1 year ago

To add accumulation support to the custom BN layer, relevant resources are this, this, and this.

andreped commented 1 year ago

I have added gradient accumulation support to the custom BN layer now in 05fb499078408283c4544041b166c8d8963cd1ad.

However, unit tests show that the results are not equivalent when increasing accum_steps. Hence, it is not yet working as intended.

andreped commented 1 year ago

Support for gradient accumulation to batch normalization has been added. Even though it is not perfect, at least people could test it and improve it further.

I also updated the documentations regarding how to use it in f54e389d1d2979cbf0409ddab4a726e58367fcae.

We can open a separate issue to benchmark it properly. Might also be that opening a discussion in the Discussions tab is the way to go. Hence, closing this issue.

axeldavy commented 1 year ago

Hi,

When looking at the proposed batch normalization, I get the impression the main modification is that the running average and variance are updated less frequently. As these are only used at test time, this should not impact training. Besides a similar effect can be had by changing the multiplication factor in the running average to average on more batches.

Maybe I'm missing something of course.

One suggestion would be to implement a batch normalization known to work well with small batches. For example Yolov4 (https://arxiv.org/pdf/2004.10934.pdf), a well known object detector, implements Cross mini-Batch Normalization for this purpose.

andreped commented 1 year ago

It makes sense to relax the weight of the running average, but not that the BatchNormalization layer does two things: 1) it updates the mean and variance in the forward step, and 2) it updates the beta and gamma in the backward step.

By adjusting the weight in the running average, you tackle the mean and variance, but the two remaining variables are still updated too frequently.

I just implemented this layer, to work as a drop-in replacement for regular BN compatible with gradient accumulation, with no new fancy tricks. Changing how often the BN layer is updates is exactly what gradient accumulation is doing, so that was the intended behaviour.

But thanks for the suggested paper. I will look into the idea :]

axeldavy commented 1 year ago

In my understanding of the code, the running mean and variance are the only parts that are updated less frequently in this version. And these are only used at test time. The batch mean and var is still used for the forward pass, as a normal BatchNormalization.

andreped commented 1 year ago

The modifications I did to the BN layer, are only relevant for training. It should work as expected for inference, regardless of gradient accumulation. But then again, I have yet to properly benchmark it. So little time on my plate for open source recently sadly