daniilidis-group / neural_renderer

A PyTorch port of the Neural 3D Mesh Renderer
Other
1.12k stars 248 forks source link

cannot run renderer on any other GPU other than GPU 0 #135

Closed rohitrango closed 1 year ago

rohitrango commented 2 years ago

Hi Nikos and team,

Thank you for the amazing work! I was trying to run the renderer in a distributed manner, and kept getting illegal memory access errors, so I tried running with CUDA_LAUNCH_BLOCKING=1 to figure out where the error is.

It seems that everything works on GPU 0, but once I switch to other GPUs, I keep getting a SIGSEGV fault. Here is a minimal example of reproducing the error:

def func():
    dev = torch.device(1)                                                                                                           
    mesh = trimesh.load_mesh('/path/to/mesh.obj')
    v, f = mesh.vertices, mesh.faces                       
    v = torch.from_numpy(v).to(dev)[None].repeat(100, 1, 1)                                     
    v = v.float()      
    f = torch.from_numpy(f).to(dev)[None].repeat(100, 1, 1) 
    r = Renderer(camera_mode='look')
    img = r.render_silhouettes(v, f)
    input()
    print(img.shape, img.dtype, img.min(), img.max())

The first point of error happens at rasterize_cuda.forward_face_index_map function, in kernel 1. Initially I thought this is because some intermediate variables inside the Renderer class are defined on GPU0, but changing them also doesn't change anything (I made sure that every tensor that goes into rasterize_cuda.forward_face_index_map has device cuda:1.

Could you tell me what could be happening here? Thanks!

rohitrango commented 1 year ago

Nevermind, its a CUDA problem (with some hardcoded parameters in the repo).