_texture_funcBackward returns nan values in cube mode

wpalfi commented 3 years ago

The following example adds a cube texture to the triangle torch sample:

import torch
import nvdiffrast.torch as dr
import matplotlib.pyplot as plt

def tensor(*args, **kwargs):
    return torch.tensor(*args, device='cuda', **kwargs)

pos = tensor([[[-0.8, -0.8, .2, 1], [0.8, -0.8, .2, 1],
             [-0.8, 0.8, .2, 1]]], dtype=torch.float32)
col = tensor([[[1, 0, 0], [0, 1, 0], [0, 0, 1]]], dtype=torch.float32)
tri = tensor([[0, 1, 2]], dtype=torch.int32)
tex = torch.rand((1, 6, 128, 128, 3), device='cuda', requires_grad=True)
vert_uv = pos[..., :3].clone()

glctx = dr.RasterizeGLContext()
rast, _ = dr.rasterize(glctx, pos, tri, resolution=[512, 512])
uv, _ = dr.interpolate(vert_uv, rast, tri)
out = dr.texture(tex, uv, boundary_mode='cube')
out.mean().backward()

plt.imshow(out[0].detach().cpu())
plt.show()

When I add with torch.autograd.detect_anomaly():, it fails with

...\test_tex.py:15: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in _texture_funcBackward. Traceback of forward call that caused the error:
  File "...\test_tex.py", line 19, in <module>
    out = dr.texture(tex, uv, boundary_mode='cube')
  File "...\nvdiffrast\torch\ops.py", line 541, in texture
    return _texture_func.apply(filter_mode, tex, uv, filter_mode_enum, boundary_mode_enum)
 (function _print_stack)
Traceback (most recent call last):
  File "...t\test_tex.py", line 20, in <module>
    out.mean().backward()
  File "...\torch\_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "...\torch\autograd\__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function '_texture_funcBackward' returned nan values in its 1th output.

nvdiffrast 0.2.5 Windows 10, VS2019 pytorch 1.9.0

wpalfi commented 3 years ago

Btw. many many thanks, I really love your library! It is incredibly fast:+1:.

s-laine commented 3 years ago

Hi @wpalfi, thanks for the report and feedback! The root cause of the issue is attempting to fetch from a cube map using all-zero texture coordinates in the background pixels. The zero vector cannot be projected onto the cube surface, and nvdiffrast doesn't currently handle this in any smart fashion and divides by zero internally. This leads to fetching from an arbitrary texel in the forward pass and producing NaNs in the backward pass.

The attached zip below has a modified version of nvdiffrast/common/texture.cu where these issues are remedied. Cube map uvs that cause internal overflows or NaNs now always output a zero result in the forward pass and yield zero gradients. I'll include these modifications also in the next release that I'll try to get out in the next couple of weeks.

nan_fix.zip (9.49 kB)

wpalfi commented 3 years ago

Thank you @s-laine for the quick response. So I guess I can just ignore the issue, as the nan-gradients are not propagated and there should also be no performance loss.

Workaround: If one needs anomaly detection, just add uv = torch.where(rast[...,3:]==0, tensor(1.), uv) before dr.texture().

s-laine commented 3 years ago

Correct - in this case the NaNs would not be propagated so it's less serious than what anomaly detection suggests. Closing.

NVlabs / nvdiffrast

_texture_funcBackward returns nan values in cube mode #32