Hi, @amlankar. I try to train the model with Multi GPUs, beacause multi-gpus training not only can reduce trianing time hugely but also imporve the accurate with bigger batch size in theory.But i got an AssertionError unfortunately. The steps that i did as bellow:
I changed the model with nn.DataParaller() method like this:
model = polyrnnpp.PolyRNNpp(self.opts)
model = nn.DataParallel(model, device_ids=(0, 1)) # i have two 1080ti GPU devices
self.model = model.cuda()
Starting training
Saved model
/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/functional.py:1006: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Traceback (most recent call last):
File "/mnt/data/polygonRNN_pluss/code/Scripts/train/train_ce.py", line 328, in <module>
trainer.loop()
File "/mnt/data/polygonRNN_pluss/code/Scripts/train/train_ce.py", line 148, in loop
self.train(epoch)
File "/mnt/data/polygonRNN_pluss/code/Scripts/train/train_ce.py", line 163, in train
output = self.model(data['img'].type(self.dtype).cuda(), data['fwd_poly'].type(self.dtype).cuda())
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in gather_map
for k in out))
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in <genexpr>
for k in out))
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/ztian5/.local/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 52, in forward
assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
Process finished with exit code 1
bty, i trained the network with single GPU fine, but when i use multi-gpu to train the network with 'nn.DataParaller()', the Error occured. Can you give me some advices on training the network with multi-gpu devices or what's wrong with i did ?
Appreciative for your reply ^_^.
Hi, @amlankar. I try to train the model with Multi GPUs, beacause multi-gpus training not only can reduce trianing time hugely but also imporve the accurate with bigger batch size in theory.But i got an AssertionError unfortunately. The steps that i did as bellow: I changed the model with
nn.DataParaller()
method like this:for training, i put input data to cuda like this:
Then, the
AssertionError
occured:bty, i trained the network with single GPU fine, but when i use multi-gpu to train the network with 'nn.DataParaller()', the Error occured. Can you give me some advices on training the network with multi-gpu devices or what's wrong with i did ? Appreciative for your reply ^_^.