Sara-Ahmed / SiT

Self-supervised vIsion Transformer (SiT)
322 stars 49 forks source link

single node multi-GPU hangs #7

Open memphizz opened 3 years ago

memphizz commented 3 years ago

Hi, I am running SSL training on a single node with two GPUs. It runs only when --nproc_per_node=1. When I set nproc_per_node=2 it gets stuck after init for the second GPU.

init_distributed_mode .... | distributed init (rank 0): env:// | distributed init (rank 1): env://

setting dist_url to env://127.0.0.1 didn't fix it. I also tried --world_size=2.

Sara-Ahmed commented 3 years ago

Which cudatoolkit are you using? It might be because of that. What is the type of your graphic card?

memphizz commented 3 years ago

I am using RTX A6000 cards which are similar to new A100. Cuda version is 11.0.221.

Sara-Ahmed commented 3 years ago

It is interesting, I have no issue at all running it in several GPUs. But, I remember getting same behavior when I was using lower version of cudatoolkit with RTX 3090.

chenkq7 commented 2 years ago

meet the same issue, the node hangs whenever reach torch.distributed.barrier(). And I solved this problem by set NCCL_P2P_LEVEL=NVL according to the thread DDP gets stuck on A40 GPUs #73206