chungyiweng / humannerf

HumanNeRF turns a monocular video of moving people into a 360 free-viewpoint video.
MIT License
786 stars 86 forks source link

NCCL Error 2: unhandled system error #87

Closed willyawan16 closed 10 months ago

willyawan16 commented 10 months ago

I get this error when trying to train zju_mocap 387. I just run the following command python train.py --cfg configs/human_nerf/zju_mocap/387/single_gpu.yaml and this happens:

DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

DESKTOP-KQ38CGS:1582:1582 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda10.2

DESKTOP-KQ38CGS:1582:1705 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/xml.cc:469 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/xml.cc:660 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO graph/topo.cc:523 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO init.cc:581 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO init.cc:840 -> 2

DESKTOP-KQ38CGS:1582:1706 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/../../0000:01:00.0 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/xml.cc:469 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/xml.cc:660 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO graph/topo.cc:523 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO init.cc:581 -> 2 DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO init.cc:840 -> 2 DESKTOP-KQ38CGS:1582:1705 [0] NCCL INFO group.cc:73 -> 2 [Async thread] DESKTOP-KQ38CGS:1582:1706 [1] NCCL INFO group.cc:73 -> 2 [Async thread] DESKTOP-KQ38CGS:1582:1582 [0] NCCL INFO init.cc:906 -> 2 Traceback (most recent call last): File "train.py", line 34, in main() File "train.py", line 28, in main train_dataloader=train_loader) File "core/train/trainers/human_nerf/trainer.py", line 151, in train div_indices=data['patch_div_indices']) File "core/train/trainers/human_nerf/trainer.py", line 109, in get_loss targets) File "core/train/trainers/human_nerf/trainer.py", line 93, in get_img_rebuild_loss scale_for_lpips(target.permute(0, 3, 1, 2))) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 91, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape tensor_copies = Broadcast.apply(devices, tensors) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 23, in forward outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus) File "/home/pc21/anaconda3/envs/humannerf2/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error