Large mismatch between bpp loss and bitstream bpp for low initial warmup learning rates

The default learning rate is 1e-4, which seems stable for most compression models.

When warming up with low initial learning rates (e.g. a ramp from 1e-6 to 1e-4), the training bpp_loss behaves reasonably. However, the actual encoded bitstream length goes to 10x the expected bpp amount! Perhaps the model is generating outputs or distributions that are poorly handled by the entropy coder.

Related: I have seen NaN occur for learning rates >1e-4. This is to be expected if the learning rate is sufficiently large. However, something as small as 2e-4 is capable of NaNing certain models. I briefly tried fiddling around with the maximum norm value for gradient clipping when using higher learning rates, but that did not seem to have an effect.

From my experimentation, initial learning rates that are not equal to 1e-4 are susceptible to unstable/unusual behavior.

EDIT: A common reason for this is forgetting to run model.update(force=True). For CompressAI Trainer, this was fixed in https://github.com/InterDigitalInc/CompressAI-Trainer/commit/57847229ff0d387117b84cb7204a806e2a9031bb on 2023-04. This certainly explains why the low warmup learning rate resulted in a mismatch. However, I haven't tested to see if it fixes the issue yet...

InterDigitalInc / CompressAI

Large mismatch between bpp loss and bitstream bpp for low initial warmup learning rates #180