I get this error when trying to train zju_mocap 387.
I just run the following command python train.py --cfg configs/human_nerf/zju_mocap/387/single_gpu.yaml and this happens:
DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DESKTOP-KQ38CGS:1582:1582 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
DESKTOP-KQ38CGS:1582:1705 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/xml.cc:469 -> 2
DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/topo.cc:523 -> 2
DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO init.cc:581 -> 2
DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO init.cc:840 -> 2
DESKTOP-KQ38CGS:1582:1706 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0
DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/xml.cc:469 -> 2
DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/topo.cc:523 -> 2
DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO init.cc:581 -> 2
DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO init.cc:840 -> 2
DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO init.cc:906 -> 2
Traceback (most recent call last):
File "train.py", line 34, in
main()
File "train.py", line 28, in main
train_dataloader=train_loader)
File "core/train/trainers/human_nerf/trainer.py", line 151, in train
div_indices=data['patch_div_indices'])
File "core/train/trainers/human_nerf/trainer.py", line 109, in get_loss
targets)
File "core/train/trainers/human_nerf/trainer.py", line 93, in get_img_rebuild_loss
scale_for_lpips(target.permute(0, 3, 1, 2)))
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 91, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, tensors)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
I get this error when trying to train zju_mocap 387. I just run the following command
python train.py --cfg configs/human_nerf/zju_mocap/387/single_gpu.yaml
and this happens:DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DESKTOP-KQ38CGS:1582:1582 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2
DESKTOP-KQ38CGS:1582:1705 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/xml.cc:469 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/xml.cc:660 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/topo.cc:523 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO init.cc:581 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO init.cc:840 -> 2
DESKTOP-KQ38CGS:1582:1706 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/xml.cc:469 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/xml.cc:660 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/topo.cc:523 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO init.cc:581 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO init.cc:840 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO group.cc:73 -> 2 [Async thread] DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO group.cc:73 -> 2 [Async thread] DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO init.cc:906 -> 2 Traceback (most recent call last): File "train.py", line 34, in
main()
File "train.py", line 28, in main
train_dataloader=train_loader)
File "core/train/trainers/human_nerf/trainer.py", line 151, in train
div_indices=data['patch_div_indices'])
File "core/train/trainers/human_nerf/trainer.py", line 109, in get_loss
targets)
File "core/train/trainers/human_nerf/trainer.py", line 93, in get_img_rebuild_loss
scale_for_lpips(target.permute(0, 3, 1, 2)))
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 91, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, tensors)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error