getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 64 forks source link

[Bug] Gradients in float16 #214

Closed wjmaddox closed 2 years ago

wjmaddox commented 2 years ago

Hi, I'm trying to get gradients out in half precision but the gradients seem to be getting clobbered. Here's a minimum working example:

from pykeops.torch import LazyTensor as KEOLazyTensor

device = torch.device("cuda:0")
half = torch.float16 # works in float32
par = torch.tensor([[0.8]], requires_grad= True, dtype=half, device=device)
xtrain = torch.randn(1000, 3, device=device, dtype=half)
x1 = xtrain / par

x1_ = KEOLazyTensor(x1[..., :, None, :])
x2_ = KEOLazyTensor(x1[..., None, :, :])

distance = ((x1_ - x2_) ** 2).sum(-1).sqrt()
exp_component = (-distance).exp()

y = torch.randn(1000, 1, dtype=half, device=device) / 31.
res = (exp_component @ y)
res.requires_grad # output is False in half but True in float

Ideally, I'd like to backpropagate wrt par (analogous to a lengthscale for a matern / rbf kernel)

My primary question is if gradients are supported in keops in half precision?

Some digging suggests that in pykeops 1.5 the inputs that might require gradients get removed here: https://github.com/getkeops/keops/blob/91cc40a3dcc56db3ca42ea6368d6b1c4dedf4a09/pykeops/torch/half2_convert.py#L108 . Would it be safe to either force a gradient flag in the preprocessing step here or to just have autograd track these preprocessing steps?

edit: A bit more digging suggests that going around GenredAutograd and doing something like this will get the gradients out:

rlt = (exp_component * y.unsqueeze(-2))
rlt.grad(x1_, y.unsqueeze(-2)).sum(1)

suggesting that it should indeed be possible to compute gradients in float16.

system info:

joanglaunes commented 2 years ago

Hi @wjmaddox , Thank you for spotting this. It should be ok now in the master branch. In fact I had already noticed that the autograd mechanism was not working with float16 type, and did some corrections a few months ago. But actually your example was failing in the master branch for another reason, related to the use of the sqrt operation with float16 type, and I just did the fix now. So I think the issue is now solved in master, please let us know is you have further questions.

wjmaddox commented 2 years ago

Thanks for the response! It does indeed seem to work on master now (although a nan gradient but that might be expected due to roundoff error here).

joanglaunes commented 2 years ago

Yes indeed your example returns a NaN gradient, but it does the same when using PyTorch instead of KeOps. Looking at it now, it is not related to float16, I think it is because of the non-differentiability of the norm at 0, so the kernel matrix coefficients are not differentiable at 0.