Open mks0601 opened 5 years ago
Hi
Can you provide a script of your code? I will check it out!
I just used your example code (example/demo_render.py
).
I added model class as below.
class Model(nn.Module):
def __init__(self, renderer):
super(Model, self).__init__()
self.renderer = renderer
def forward(self, mesh, camera_distance, elevation, azimuth):
self.renderer.transform.set_eyes_from_angles(camera_distance, elevation, azimuth)
images = self.renderer.render_mesh(mesh)
return images
And defined a model
with torch.nn.DataParallel
after defining renderer
.
model = torch.nn.DataParallel(Model(renderer)).cuda()
In the loop, I changed those lines
renderer.transform.set_eyes_from_angles(camera_distance, elevation, azimuth)
images = renderer.render_mesh(mesh)
into
imgaes = model(mesh, camera_distance, elevation, azimuth)
.
All others are the same.
Hi,
I have slightly changed the code. I suppose the problem is because previous code did not specify the cuda devices in soft_rasterizer. Maybe it would fix the bug.
Just Wondering, does the fix solve your problem? @mks0601
Doesn't seem so for me.
Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/aluo/tools/SoftRas/soft_renderer/renderer.py", line 102, in forward
return self.render_mesh(mesh, mode)
File "/home/aluo/tools/SoftRas/soft_renderer/renderer.py", line 96, in render_mesh
mesh = self.lighting(mesh)
File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/aluo/tools/SoftRas/soft_renderer/lighting.py", line 57, in forward
mesh.textures = mesh.textures * light[:, :, None, :]
RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 3
I tried a similar strategy for DIB-R and got memory errors.
I somehow fixed this issue. Could you check all your tensors are cuda
type and some out-of-range index problem?
Looking over my code, it seems to be correct. Running the same code on a model without dataparallel works. Could you provide a small snippet of how you initialize your dataparallel model and run a mesh through it?
I don't think I do something special on the DataParallel. I justed set face_texture at L44 of https://github.com/ShichenLiu/SoftRas/blob/master/soft_renderer/rasterizer.py to zero tensors because I do not use texture.
Note that that change probably not a solution of this error. Sorry I asked this question a while ago, so I cannot clearly remember what I did to fix this error.
Much appreciated. I'll try again some time later this week and report back with results.
So it works now, following your code example that you provided. And checking via nvidia-smi
seems to indicate that processing/memory is distributed between two GPUs.
It turns out the there were a few bugs, but they were all introduced when I modified SoftRas (mostly around texture/view transforms). I think we can close this issue now.
Did you train your model with multiple GPUs? When I train my model with your module in multi-gpu environment, it encounters an error as below. I used nn.DataParallel to wrap my model for multi-gpu training.
RuntimeError: CUDA error: an illegal memory access was encountered (block at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/ATen/cuda/CUDAEvent.h:96)
Can you give me some help?