AliaksandrSiarohin / monkey-net

Animating Arbitrary Objects via Deep Motion Transfer
467 stars 81 forks source link

Error running code on 2 GPUs #14

Closed TsainGra closed 4 years ago

TsainGra commented 4 years ago

Use predefined train-test split. Transfer... /usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 return f(*args, **kwds)

0it [00:00, ?it/s] Traceback (most recent call last): File "run.py", line 80, in transfer(config, generator, kp_detector, opt.checkpoint, log_dir, dataset) File "/home/kushagra/monkey-net/transfer.py", line 112, in transfer out = transfer_one(generator, kp_detector, source_image, driving_video, transfer_params) File "/home/kushagra/monkey-net/transfer.py", line 68, in transfer_one kp_driving = cat_dict([kp_detector(driving_video[:, :, i:(i + 1)]) for i in range(d)], dim=1) File "/home/kushagra/monkey-net/transfer.py", line 68, in kp_driving = cat_dict([kp_detector(driving_video[:, :, i:(i + 1)]) for i in range(d)], dim=1) File "/home/kushagra/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/kushagra/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/home/kushagra/monkey-net/sync_batchnorm/replicate.py", line 65, in replicate modules = super(DataParallelWithCallback, self).replicate(module, device_ids) File "/home/kushagra/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate return replicate(module, device_ids) File "/home/kushagra/.local/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate param_copies = Broadcast.apply(devices, params) File "/home/kushagra/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus) File "/home/kushagra/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: all tensors must be on devices[0]

I understand that I need to put all the input tensors on the 0 device. But not sure exactly how to do that, I tried some ways from https://discuss.pytorch.org/t/how-to-solve-the-problem-of-runtimeerror-all-tensors-must-be-on-devices-0/15198/5 however that did not work.

I also put all the models to device 1 [For eg. generator.to(opt.device_ids[1])], in the hope that it will free up space for tensors in device 0 (otherwise I would get a CUDA out of memory error).

Running the model on 2 RTX 2080 with CUDA 10

AliaksandrSiarohin commented 4 years ago

Only training on 2 gpu is supported. If you want to spead up the transfer divide you pairs in 2 csv files and run 2 separate processes.