facebookresearch / InterWild

Official PyTorch implementation of "Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild", CVPR 2023
Other
159 stars 16 forks source link

Error when train on multi-gpu #15

Closed dueToLife closed 9 months ago

dueToLife commented 9 months ago

Thanks for your great work and open source code!

When I use your code to train from zero on single GPU,it works fine. But when I train it on multi-gpu:

train.py --gpu 0,1,2,3

I got error message:

Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/wukp/anaconda3/envs/carl/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/wukp/anaconda3/envs/carl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/wukp/Codes/InterWild/main/model.py", line 104, in forward
    rjoint_proj, rjoint_cam, rmesh_cam, rroot_cam = self.get_coord(rroot_pose, rhand_pose, rshape, rroot_trans, 'right')
  File "/data/wukp/Codes/InterWild/main/model.py", line 40, in get_coord
    output = self.mano_layer[hand_type](betas=shape, hand_pose=hand_pose, global_orient=root_pose, transl=zero_trans)
  File "/home/wukp/anaconda3/envs/carl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wukp/anaconda3/envs/carl/lib/python3.9/site-packages/smplx/body_models.py", line 1672, in forward
    full_pose += self.pose_mean
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

It looks like data out of sync. I have tried many solutions, but still failed. Now I think maybe it because of python env. Could you please provide requiments.txt or give me some advice? Thank you!

mks0601 commented 9 months ago

Hi,

Could you change mano layer definition from below

https://github.com/facebookresearch/InterWild/blob/186eec4e814e41e6753462e7f8619e2e8cdcddfb/main/model.py#L30

to

self.mano_layer_right = copy.deepcopy(mano.layer['right']).cuda() self.mano_layer_left = copy.deepcopy(mano.layer['left']).cuda()

dueToLife commented 9 months ago

Thank you for your kind reply! It works for me! I'll close this issue.