Open Aurora-slz opened 2 years ago
I use one node with four GPU(V100, 32G) for pretrain, but parallel training is a little weird. All four processes run on one(device:0) GPU. Why it happened? Thanks for everyone's help!
I use this to launch the pretraining.
python submitit_pretrain.py \ --job_dir mae/slz_job/tmp \ --nodes 1 \ --use_volta32 \ --batch_size 64 \ --model mae_vit_large_patch16 \ --norm_pix_loss \ --mask_ratio 0.75 \ --epochs 800 \ --warmup_epochs 40 \ --blr 1.5e-4 --weight_decay 0.05 \ --data_path imagenet/tiny/tiny-imagenet-200
This is job_envinformation. job_env: JobEnvironment(job_id=1582, hostname=slz-z5dbj-52465-worker-0, local_rank=0(4), node=0(1), global_rank=0(4))
job_env
job_env: JobEnvironment(job_id=1582, hostname=slz-z5dbj-52465-worker-0, local_rank=0(4), node=0(1), global_rank=0(4))
You might want to take a look at issue #48
Hello, I also have the same problem with you, have you solved it? Or could you give me some suggestions? Thanks a lot!
I use one node with four GPU(V100, 32G) for pretrain, but parallel training is a little weird. All four processes run on one(device:0) GPU. Why it happened? Thanks for everyone's help!
I use this to launch the pretraining.
This is
job_env
information.job_env: JobEnvironment(job_id=1582, hostname=slz-z5dbj-52465-worker-0, local_rank=0(4), node=0(1), global_rank=0(4))