Open Approximetal opened 4 years ago
It seems wrong indeed to have process 136373 use both GPUs 1 and 2, so it seems to indicate that for some reason, some parts of PyTorch use GPU 1 while some other parts use GPU 2.
Also I'm not sure how CUDA_VISIBLE_DEVICES=1,2
and --gpu_names=1,2
interact with each other.
That said, it looks like a question for the PyTorch project, not NCCL.
Hi, I was trying to use DistributeDataParallel for training. It all works well on single GPU, but when I try to use 2 GPUs, it hangs, and the
nvidia-smi
shows in GPU1 there are 2 threads, in GPU2, there is only 1 thread.The detail is as follow, Any Thing I can do to deal with it?