facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377
Other
6.93k stars 1.17k forks source link

Running multi-gpu on one node #120

Open kaushikb258 opened 1 year ago

kaushikb258 commented 1 year ago

I am using 2 GPUs on one node. I use the following command but the code gets stuck inside the init_distributed_mode() function in util/misc.py. Specifically, it gets stuck at the line: torch.distributed.barrier(), with no further progress, just hangs. And I am using the distributed sampler. How to fix this issue?

CUDA_VISIBLE_DEVICES=2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=2 main_pretrain.py \ --batch_size 64 \ --model mae_vit_base_patch16 \ --norm_pix_loss \ --mask_ratio 0.75 \ --epochs 200 \ --warmup_epochs 0 \ --blr 1.5e-4 --weight_decay 0.05 \ --data_path ${IMAGENET_DIR}

kaushikb258 commented 1 year ago

Actually I was able to solve the problem. I changed the backend in util/misc.py to gloo and it worked. Maybe someone will find this helpful.

args.dist_backend = 'gloo' #'nccl'