The NCCL error - Githubissues

mapengsen commented 1 year ago

$ python train_ddgan.py --dataset cifar10 --exp ddgan_cifar10_exp1 --num_channels 3 --num_channels_dae 128 --num_timesteps 4 --num_res_blocks 2 --batch_size 64 --num_epoch 1800 --ngf 64 --nz 100 --z_emb_dim 256 --n_mlp 4 --embedding_type positional --use_ema --ema_decay 0.9999 --r1_gamma 0.02 --lr_d 1.25e-4 --lr_g 1.6e-4 --lazy_reg 15 --num_process_per_node 1 --ch_mult 1 2 2 2 --save_content starting in debug mode Files already downloaded and verified Traceback (most recent call last): File "train_ddgan.py", line 564, in <module> init_processes(0, size, train, args) File "train_ddgan.py", line 470, in init_processes fn(rank, gpu, args) File "train_ddgan.py", line 265, in train broadcast_params(netG.parameters()) File "train_ddgan.py", line 36, in broadcast_params dist.broadcast(param.data, src=0) File "/home/mapengsen/anaconda3/envs/37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1039, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

when i run train CIFAR-10 ,but get the NCCL error，how can i solve it ?please

tonia86 commented 1 year ago

I have the same problem, did you solve it?

pzq-xjtu commented 1 year ago

@mapengsen @tonia86 Same problem, did you solve it? I need your help!

NVlabs / denoising-diffusion-gan

The NCCL error #28