dvlab-research / Stratified-Transformer

Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)
MIT License
376 stars 40 forks source link

DataParallel error -> RuntimeError: Caught RuntimeError in replica 0 on device 0 #75

Closed praj441 closed 1 year ago

praj441 commented 1 year ago

I am not able to use your code with multi GPU training using nn.(DataParallel) error. The code is running fine when I do - model = torch.nn.DataParallel(model.cuda()) ----> model = model.cuda()

Have you tried using the code with DataParallel enabled?

Log snippets -

output = net(feat, coord, offset, batch, neighbor_idx) File "/home/prem/anaconda3/envs/alpha/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/prem/anaconda3/envs/alpha/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/prem/anaconda3/envs/alpha/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/prem/anaconda3/envs/alpha/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/prem/anaconda3/envs/alpha/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0.

X-Lai commented 1 year ago

Thanks for your interest in our work. Currently, the code only supports DDP.