Open GravitySaika opened 5 months ago
Thank you for sharing the code! This is my script run.sh:
CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=4 --master_port 12345 main_pretrain.py \ --num_workers 10 \ --accum_iter 2 \ --batch_size 128 \ --model mrm \ --norm_pix_loss \ --mask_ratio 0.75 \ --epochs 200 \ --warmup_epochs 40 \ --blr 1.5e-4 --weight_decay 0.05 \ --resume ./MRM/mae_pretrain_vit_base.pth \ --data_path ./MRM \ --output_dir ./MRM \
When I use distributed computing, the program always gets stuck in this position
and never continue. But if I set the graphics card to one, it can train at a very slow speed.
I'm wondering how to deal with it.
Thank you for sharing the code! This is my script run.sh:
When I use distributed computing, the program always gets stuck in this position
and never continue. But if I set the graphics card to one, it can train at a very slow speed.
I'm wondering how to deal with it.