pytorch Multi-GPU program hangs with unblance threads?

Hi, I was trying to use DistributeDataParallel for training. It all works well on single GPU, but when I try to use 2 GPUs, it hangs, and the nvidia-smi shows in GPU1 there are 2 threads, in GPU2, there is only 1 thread.

|    1    136372      C   ...uan.zhao/anaconda3/envs/py36/bin/python  3390MiB |
|    1    136373      C   ...uan.zhao/anaconda3/envs/py36/bin/python  3382MiB |
|    2    136373      C   ...uan.zhao/anaconda3/envs/py36/bin/python  1111MiB |

The detail is as follow, Any Thing I can do to deal with it?

(py36) [zhiyuan.zhao@cloud-tts parrotron]$ CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 train_vc2_wavernn.py --gpu_name=1,2
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
log_dir: output_models/logs/20200903-143551_
log_dir: output_models/logs/20200903-143551_
GPU Memory Track | 03-Sep-20-14:35:51 | Total Used Memory:1470.7 Mb

FP16 Run: True
Dynamic Loss Scaling: True
cuDNN Enabled: True
cuDNN Benchmark: False
Distributed Run: True
NCCL avaliable: True
Using Device: cuda:0
GPU Memory Track | 03-Sep-20-14:35:51 | Total Used Memory:1470.7 Mb

FP16 Run: True
Dynamic Loss Scaling: True
cuDNN Enabled: True
cuDNN Benchmark: False
Distributed Run: True
NCCL avaliable: True
Using Device: cuda:0
calculating global mean...
calculating global mean...
Saved global_mean_dict, 2147 speakers
Initializing Distributed
Saved global_mean_dict, 2147 speakers
Initializing Distributed
Done initializing distributed.
Done initializing distributed.
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
cloud-tts:136372:136372 [0] NCCL INFO Bootstrap : Using [0]enp97s0f0:10.10.1.116<0>
cloud-tts:136372:136372 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

cloud-tts:136372:136372 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cloud-tts:136372:136372 [0] NCCL INFO NET/Socket : Using [0]enp97s0f0:10.10.1.116<0>
NCCL version 2.4.8+cuda10.0
cloud-tts:136372:136392 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,f0003fff
cloud-tts:136373:136373 [0] NCCL INFO Bootstrap : Using [0]enp97s0f0:10.10.1.116<0>
cloud-tts:136373:136373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

cloud-tts:136373:136373 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
cloud-tts:136373:136373 [0] NCCL INFO NET/Socket : Using [0]enp97s0f0:10.10.1.116<0>
cloud-tts:136373:136395 [0] NCCL INFO Setting affinity for GPU 0 to 03ff,f0003fff
cloud-tts:136372:136392 [0] NCCL INFO Channel 00 :    0   1
cloud-tts:136373:136395 [0] NCCL INFO Ring 00 : 1[1] -> 0[1] via P2P/IPC
cloud-tts:136372:136392 [0] NCCL INFO Ring 00 : 0[1] -> 1[1] via P2P/IPC
cloud-tts:136372:136392 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
cloud-tts:136373:136395 [0] NCCL INFO comm 0x7ff4500016b0 rank 1 nranks 2 cudaDev 0 nvmlDev 1 - Init COMPLETE
cloud-tts:136372:136392 [0] NCCL INFO comm 0x7fbe040016b0 rank 0 nranks 2 cudaDev 0 nvmlDev 1 - Init COMPLETE
cloud-tts:136372:136372 [0] NCCL INFO Launch mode Parallel
DataSet Size: 9888
Loading checkpoint 'output_models/models/20200818-202109_3TTS_biaobei_ST_1e-4/checkpoint_newest'
DataSet Size: 9888
Loading checkpoint 'output_models/models/20200818-202109_3TTS_biaobei_ST_1e-4/checkpoint_newest'
Loaded checkpoint 'output_models/models/20200818-202109_3TTS_biaobei_ST_1e-4/checkpoint_newest' from iteration 2035501
Epoch: 27140
Loaded checkpoint 'output_models/models/20200818-202109_3TTS_biaobei_ST_1e-4/checkpoint_newest' from iteration 2035501
Epoch: 27140
/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:100: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:100: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn("torch.distributed.reduce_op is deprecated, please use "
^CTraceback (most recent call last):
  File "/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
    main()
  File "/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/launch.py", line 239, in main
    process.wait()
  File "/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/subprocess.py", line 1457, in wait
    (pid, sts) = self._try_wait(0)
  File "/home/zhiyuan.zhao/anaconda3/envs/py36/lib/python3.6/subprocess.py", line 1404, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

NVIDIA / nccl

pytorch Multi-GPU program hangs with unblance threads? #383