facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377
Other
7.2k stars 1.2k forks source link

Fake multi gpus pretraining #111

Open Aurora-slz opened 2 years ago

Aurora-slz commented 2 years ago

I use one node with four GPU(V100, 32G) for pretrain, but parallel training is a little weird. All four processes run on one(device:0) GPU. Why it happened? Thanks for everyone's help!

I use this to launch the pretraining.

python submitit_pretrain.py \
    --job_dir mae/slz_job/tmp \
    --nodes 1 \
    --use_volta32 \
    --batch_size 64 \
    --model mae_vit_large_patch16 \
    --norm_pix_loss \
    --mask_ratio 0.75 \
    --epochs 800 \
    --warmup_epochs 40 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --data_path imagenet/tiny/tiny-imagenet-200

This is job_envinformation. job_env: JobEnvironment(job_id=1582, hostname=slz-z5dbj-52465-worker-0, local_rank=0(4), node=0(1), global_rank=0(4))

johnsk95 commented 2 years ago

You might want to take a look at issue #48

Mark-Dou commented 1 year ago

Hello, I also have the same problem with you, have you solved it? Or could you give me some suggestions? Thanks a lot!