This PR adds multi_tensor_unscale_l2norm_cuda, which is used to fuse gradient unscaling (with AMP) and L2 norm computation of the gradients.
To retain the original precision of the gradients (especially FP16), unscaling is only accounted for in the norm computation and is not applied to the gradients themselves.
This PR adds multi_tensor_unscale_l2norm_cuda, which is used to fuse gradient unscaling (with AMP) and L2 norm computation of the gradients. To retain the original precision of the gradients (especially FP16), unscaling is only accounted for in the norm computation and is not applied to the gradients themselves.