facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377
Other
7.2k stars 1.2k forks source link

Replication of ImageNet Scratch Vit-Base and Vit-Large #41

Open srikar2097 opened 2 years ago

srikar2097 commented 2 years ago

Hi @KaimingHe, thank you for your wonderful work and initiative to open-sourcing it.

I have been banging my head against this for a month now and any help would be deeply appreciated! I am trying to replicate the IN-1k scratch training on Vit-Base and Vit-large. So far failed. I get around 81.9 for base (82.3 in paper) and 82 for vit-large (82.6 in paper) using recipes mentioned below - both taken from here.

I also saw your reply in #28 and also the FINETUNE.md page. I noticed the supervised recipe in paper is different from the fine-tune one. Hence two different recipes I tried.

pytorch version: 1.10.0 cuda: 11.3 timm version 0.5.0 setup is single host with 8 gpu. base lr = 1e-4. so effective batch size 4096 => effective lr = 0.0016 with recipe mentioned in paper: vit-base 82.3, I get 81.9

            --grad_accum=4 \
            --batch-size=128 \
            --lr=0.0016 \
            --weight-decay=0.30 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \

Following your advise in #28 and FINETUNE.md page, I got inspired to try this different recipe. I get 81.2 accuracy with below recipe here too single host with 8 gpu. base lr = 1e-3 so for batch size 1024 (as mentioned in finetune.py so => effective lr = 0.004 Also paper does not mention layer-decay or different --aa strategy

            --grad_accum=4 \
            --batch-size=32 \
            --lr=0.004 \
            --weight-decay=0.05 \
            --layer_decay=0.65 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5-inc1" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \
            --reprob=0.25 \

What is wrong in these recipes and why don't they replicate the scratch performance mentioned in the paper?

KeyKy commented 2 years ago

could you share your training log? I am trying to replicate the IN-1k scratch training too.

Capricious-Liu commented 2 years ago

hi, would be appreciated it if sharing your training log!!

cxxgtxy commented 2 years ago

Hi @KaimingHe, thank you for your wonderful work and initiative to open-sourcing it.

I have been banging my head against this for a month now and any help would be deeply appreciated! I am trying to replicate the IN-1k scratch training on Vit-Base and Vit-large. So far failed. I get around 81.9 for base (82.3 in paper) and 82 for vit-large (82.6 in paper) using recipes mentioned below - both taken from here.

I also saw your reply in #28 and also the FINETUNE.md page. I noticed the supervised recipe in paper is different from the fine-tune one. Hence two different recipes I tried.

pytorch version: 1.10.0 cuda: 11.3 timm version 0.5.0 setup is single host with 8 gpu. base lr = 1e-4. so effective batch size 4096 => effective lr = 0.0016 with recipe mentioned in paper: vit-base 82.3, I get 81.9

            --grad_accum=4 \
            --batch-size=128 \
            --lr=0.0016 \
            --weight-decay=0.30 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \

Following your advise in #28 and FINETUNE.md page, I got inspired to try this different recipe. I get 81.2 accuracy with below recipe here too single host with 8 gpu. base lr = 1e-3 so for batch size 1024 (as mentioned in finetune.py so => effective lr = 0.004 Also paper does not mention layer-decay or different --aa strategy

            --grad_accum=4 \
            --batch-size=32 \
            --lr=0.004 \
            --weight-decay=0.05 \
            --layer_decay=0.65 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5-inc1" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \
            --reprob=0.25 \

What is wrong in these recipes and why don't they replicate the scratch performance mentioned in the paper?

Have you reproduced the reported result?

tarun005 commented 1 year ago

@srikar2097 Can you share the models you managed to train ViT-S/L from scratch on IN-1k?