Why is loss_scale required for tensor cores.

mattangus commented 4 years ago

I've been trying to speed up my training using tensor cores. I've been looking through the code and reading other issues. I came across this line:

if (state.index != 0 && state.net.cudnn_half && !l.xnor && (!state.train || (iteration_num > 3 * state.net.burn_in) && state.net.loss_scale != 1) &&
        (l.c / l.groups) % 8 == 0 && l.n % 8 == 0 && l.groups <= 1 && l.size > 1)

The first parts make sense to me. The latter bits don't. Specifically:

loss_scale != 1: Why is a loss scale needed?
(l.c / l.groups) % 8 == 0: I think I don't understand groups here to know what this part means
l.groups <= 1: this seems to be redundant to the one above. if groups == 1 then the one above should be l.c % 8 == 0.

I saw this comment that says loss_scale is needed but no explaination.

Any comments on this would be very helpful!

AlexeyAB commented 4 years ago

https://developer.nvidia.com/automatic-mixed-precision

Enabling mixed precision involves two steps: porting the model to use the half-precision data type where appropriate, and using loss scaling to preserve small gradient values.

https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/

https://nvlabs.github.io/iccv2019-mixed-precision-tutorial/files/dusan_stosic_intro_to_mixed_precision_training.pdf

mattangus commented 4 years ago

This is exactly what I was looking for. Thanks!

AlexeyAB / darknet

Why is loss_scale required for tensor cores. #6866