Multi-gpu training - Githubissues

ShichenLiu / SoftRas

Project page of paper "Soft Rasterizer: A Differentiable Renderer for Image-based 3D Reasoning"

MIT License

1.24k stars 156 forks source link

Multi-gpu training #3

Open mks0601 opened 5 years ago

mks0601 commented 5 years ago

Did you train your model with multiple GPUs? When I train my model with your module in multi-gpu environment, it encounters an error as below. I used nn.DataParallel to wrap my model for multi-gpu training.

RuntimeError: CUDA error: an illegal memory access was encountered (block at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/ATen/cuda/CUDAEvent.h:96)

Can you give me some help?

ShichenLiu commented 5 years ago

Can you provide a script of your code? I will check it out!

mks0601 commented 5 years ago

I just used your example code (example/demo_render.py).

I added model class as below.

class Model(nn.Module):

    def __init__(self, renderer):
        super(Model, self).__init__()
        self.renderer = renderer

    def forward(self, mesh, camera_distance, elevation, azimuth):
        self.renderer.transform.set_eyes_from_angles(camera_distance, elevation, azimuth)
        images = self.renderer.render_mesh(mesh)

        return images

And defined a model with torch.nn.DataParallel after defining renderer.

model = torch.nn.DataParallel(Model(renderer)).cuda()

In the loop, I changed those lines

renderer.transform.set_eyes_from_angles(camera_distance, elevation, azimuth)
images = renderer.render_mesh(mesh)

into

imgaes = model(mesh, camera_distance, elevation, azimuth).

All others are the same.

ShichenLiu commented 5 years ago

Hi,

I have slightly changed the code. I suppose the problem is because previous code did not specify the cuda devices in soft_rasterizer. Maybe it would fix the bug.

ReNginx commented 4 years ago

Just Wondering, does the fix solve your problem? @mks0601

aluo-x commented 4 years ago

Doesn't seem so for me.

Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aluo/tools/SoftRas/soft_renderer/renderer.py", line 102, in forward
    return self.render_mesh(mesh, mode)
  File "/home/aluo/tools/SoftRas/soft_renderer/renderer.py", line 96, in render_mesh
    mesh = self.lighting(mesh)
  File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aluo/tools/SoftRas/soft_renderer/lighting.py", line 57, in forward
    mesh.textures = mesh.textures * light[:, :, None, :]
RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 3

I tried a similar strategy for DIB-R and got memory errors.

mks0601 commented 4 years ago

I somehow fixed this issue. Could you check all your tensors are cuda type and some out-of-range index problem?

aluo-x commented 4 years ago

Looking over my code, it seems to be correct. Running the same code on a model without dataparallel works. Could you provide a small snippet of how you initialize your dataparallel model and run a mesh through it?

mks0601 commented 4 years ago

I don't think I do something special on the DataParallel. I justed set face_texture at L44 of https://github.com/ShichenLiu/SoftRas/blob/master/soft_renderer/rasterizer.py to zero tensors because I do not use texture.

mks0601 commented 4 years ago

Note that that change probably not a solution of this error. Sorry I asked this question a while ago, so I cannot clearly remember what I did to fix this error.

aluo-x commented 4 years ago

Much appreciated. I'll try again some time later this week and report back with results.

aluo-x commented 4 years ago

So it works now, following your code example that you provided. And checking via nvidia-smi seems to indicate that processing/memory is distributed between two GPUs.

It turns out the there were a few bugs, but they were all introduced when I modified SoftRas (mostly around texture/view transforms). I think we can close this issue now.