Mixed precision training

awslabs / fast-differential-privacy

Fast, memory-efficient, scalable optimization of deep learning with differential privacy

Apache License 2.0

83 stars 11 forks source link

Mixed precision training #18

Closed lccnl closed 7 months ago

lccnl commented 7 months ago

Hello, I would like to use the library to train in mixed precision, for example, using accelerate of HuggingFace or directly the amp context_manager of torch.

I end up having a mismatch of types (float32 versus float16):

either when assigning param.grad = param.summed_clipped_grad here
when computing the gradients (some activations in float 32, some backprops in float 16).

The easiest solution I see would be to convert everything to float32 when calling this method and this one.

Do you have a better one? I am happy to make a PR if needed :)

woodyx218 commented 7 months ago

Hi, we have supported mixed precision training that use float16 or bf16 for all layers, which is different from amp that may applies float16 on some layers but float32 on others. For example of float16, the activations, backprop gradients, param.grad and param.summed_clipped_grad should be float16 before the optimizer is called. Feel free to make a PR!

lccnl commented 7 months ago

ok thanks for your explanation! I made a PR where I cast to float32 everything when calling the hooks. Let me know if that is fine for you!

woodyx218 commented 7 months ago

I am confused here. If "everything" is cast to float32, then this is no longer mixed precision and all efficiency benefits are lost, right?

lccnl commented 7 months ago

My understanding is that since the hook does not return a gradient in the current case, the standard backpropagation is run in mixed precision while each hook will be run with float32. Am I wrong?

So indeed, if all layers are run in float16, there is a little overhead that is added because the DP part is in float32. I can set a flag maybe that sets whether converting or not to float32, so that both optimized cases are possible (either having all/only some layers in float16)? It can be an argument of the init of the privacy engine or of the method attach. It could be something like cast_dp_hook_to_float32?

Open to better ideas, naming etc!

woodyx218 commented 7 months ago

Because we are using the algorithm from "Differentially Private Optimization on Large Model at Small Cost", the hooks are used during back-prop and should be using mixed precision. E.g float16 output gradients are book-kept by the hooks. My concern is that if we use DP part with float32, it would be less efficient if only some layers are using float32 activations.

lccnl commented 7 months ago

ok so I set type promotions only when the input tensors have different dtypes, so that if everything is in float16, there will not be any loss of speed.

woodyx218 commented 7 months ago

Thank you @lccnl for contributing the pull request. We are preparing a major update and will test your PR as soon as possible. I am closing this issue for now.

lccnl commented 7 months ago

ok thanks!