multi-gpu problem - Githubissues

rainofmine commented 6 years ago

When I use one gpu to run, it is OK. When I use muti-gpu, something is wrong like this.

File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply raise output File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker output = module(*input, *kwargs) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/data3/hooks/retinanet/model.py", line 274, in forward x3 = self.layer3(x2) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/data3/hooks/retinanet/non_local.py", line 94, in forward output = self.operation_function(x) File "/data3/hooks/retinanet/non_local.py", line 101, in _embedded_gaussian g_x = self.g(x).view(batch_size, self.inter_channels, -1) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/data2/gjt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

AlexHex7 commented 6 years ago

How about removing the non-local block and then runing the code in multi-gpu? It looks like the two arguments will in the same devices, what the two arguments are? And do you put your model into the data_parallel mode?

rainofmine commented 6 years ago

Just move the non-local layer, it is OK. And now I find using the simple version with muti-gpu is OK. I'm trying to find what's the difference...

AlexHex7 commented 5 years ago

@rainofmine Hi, someone share the reason with me. I have update my code and you can the reason in https://github.com/pytorch/pytorch/issues/8637

AlexHex7 / Non-local_pytorch

multi-gpu problem #10