NVIDIA A100: Rendering artefacts

FilatovArtm commented 2 years ago

With some A100 we obtain empty pixels during rendering. The problem is reproduced only with A100 and it is random depending on the card. We checked V100, 3090Ti - they are doing great. The Nvidia forum contains similar issue (I also reported there) https://forums.developer.nvidia.com/t/nvidia-a100-opengl-drawing-issue/196672

We discovered that removing line NVDR_CHECK_GL_ERROR(glEnable(GL_DEPTH_TEST)); in rasterize.cpp fix the issue.

Driver version: 450.119.04

Code to reproduce the issue:

import torch
import nvdiffrast.torch as dr
import numpy as np
from matplotlib import pyplot as plt

def tensor(*args, **kwargs):
    return torch.tensor(*args, device='cuda', **kwargs)

depth = 5.1761e+00 / 6.0623e+00

pos = tensor([[[ 8.5783e-02,  9.9548e-02,  2.0576e+00,  3.0065e+00],
         [-1.7052e+00,  1.3828e-01,  2.0506e+00,  2.9996e+00],
         [ 6.6282e-02, -3.4532e+00,  2.0323e+00,  2.9817e+00],
         [ 8.4978e-02,  9.4055e-02,  3.0781e+00,  4.0065e+00],
         [-2.6015e+00,  1.5215e-01,  3.0675e+00,  3.9961e+00],
         [ 1.1423e-01,  5.4232e+00,  3.1161e+00,  4.0437e+00],
         [ 8.4174e-02,  8.8562e-02,  4.0986e+00,  5.0065e+00],
         [ 4.1140e+00,  1.4267e-03,  4.1145e+00,  5.0221e+00],
         [ 1.2805e-01,  8.0822e+00,  4.1556e+00,  5.0623e+00]]], dtype=torch.float32)
tri = torch.from_numpy(
    np.arange(len(pos[0])).reshape(-1, 3)
).to(torch.int32).cuda()

glctx = dr.RasterizeGLContext()
rast, _ = dr.rasterize(glctx, pos, tri, resolution=[1024, 1251])

plt.figure(figsize=[12,12])
plt.imshow(rast.cpu()[0])

plt.show()

The results are the following: 5c28176637b6e89d7f8dd9865d07b79abe836c93

SteveJunGao commented 2 years ago

I also find this issue when running Nvdiffrast in A100 GPU, thanks for sharing the fix!! It works now!

Updates: It's still having issues when rendering in 3D, the depth is incorrect

dmikis commented 2 years ago

I also find this issue when running Nvdiffrast in A100 GPU, thanks for sharing the fix!! It works now!

If you're referring to disabling depth test, please note that that changes logic: hidden surface removal won't work, you may obtain incorrect result.

SteveJunGao commented 2 years ago

Yes, you're right, it's still having issues in the 3D (the depth is incorrect)

s-laine commented 2 years ago

@SteveJunGao I unfortunately cannot replicate the bug myself on any of the A100s I have access to, so I don't really have a way to proceed here. Are you able to check the VBIOS version of your A100, in case it's related? Are you seeing this consistently on all your A100s or just some of them? If you have both working and failing A100s, can you check if they have the same VBIOS version?

FilatovArtm commented 2 years ago

Yes, they have the same VBIOS. We found out that restarting GPU with nvidia-smi -r resolves the issue. Moreover, sometimes, during the training procedure we got CUDA_ILLEGAL_MEMORY_ADDRESS (we don't know what is the cause, we just use nvdiffrast in a straightforward manner, the same way examples demonstrate). After restarting the ipython kernel artefacts are coming back! So, somehow, they persist on the GPU until the next reset of the GPU and driver.

s-laine commented 2 years ago

v0.2.8 includes a workaround that fixes this issue. Closing.

NVlabs / nvdiffrast

NVIDIA A100: Rendering artefacts #62