Open srikar2097 opened 2 years ago
could you share your training log? I am trying to replicate the IN-1k scratch training too.
hi, would be appreciated it if sharing your training log!!
Hi @KaimingHe, thank you for your wonderful work and initiative to open-sourcing it.
I have been banging my head against this for a month now and any help would be deeply appreciated! I am trying to replicate the IN-1k scratch training on Vit-Base and Vit-large. So far failed. I get around 81.9 for base (82.3 in paper) and 82 for vit-large (82.6 in paper) using recipes mentioned below - both taken from here.
I also saw your reply in #28 and also the FINETUNE.md page. I noticed the supervised recipe in paper is different from the fine-tune one. Hence two different recipes I tried.
pytorch version: 1.10.0 cuda: 11.3 timm version 0.5.0 setup is single host with 8 gpu. base lr = 1e-4. so effective batch size 4096 => effective lr = 0.0016 with recipe mentioned in paper: vit-base 82.3, I get 81.9
--grad_accum=4 \ --batch-size=128 \ --lr=0.0016 \ --weight-decay=0.30 \ --opt="AdamW" \ --opt-betas 0.9 0.95 \ --sched="cosine" \ --warmup-epochs=20 \ --epochs=300 \ --aa="rand-m9-mstd0.5" \ --smoothing=0.1 \ --mixup=0.8 \ --cutmix=1.0 \ --drop-path=0.1 \ --model-ema \ --model-ema-decay=0.9999 \ --model="vit_base_patch16_224" \
Following your advise in #28 and FINETUNE.md page, I got inspired to try this different recipe. I get 81.2 accuracy with below recipe here too single host with 8 gpu. base lr = 1e-3 so for batch size 1024 (as mentioned in finetune.py so => effective lr = 0.004 Also paper does not mention
layer-decay
or different--aa
strategy--grad_accum=4 \ --batch-size=32 \ --lr=0.004 \ --weight-decay=0.05 \ --layer_decay=0.65 \ --opt="AdamW" \ --opt-betas 0.9 0.95 \ --sched="cosine" \ --warmup-epochs=20 \ --epochs=300 \ --aa="rand-m9-mstd0.5-inc1" \ --smoothing=0.1 \ --mixup=0.8 \ --cutmix=1.0 \ --drop-path=0.1 \ --model-ema \ --model-ema-decay=0.9999 \ --model="vit_base_patch16_224" \ --reprob=0.25 \
What is wrong in these recipes and why don't they replicate the scratch performance mentioned in the paper?
Have you reproduced the reported result?
@srikar2097 Can you share the models you managed to train ViT-S/L from scratch on IN-1k?
Hi @KaimingHe, thank you for your wonderful work and initiative to open-sourcing it.
I have been banging my head against this for a month now and any help would be deeply appreciated! I am trying to replicate the IN-1k scratch training on Vit-Base and Vit-large. So far failed. I get around 81.9 for base (82.3 in paper) and 82 for vit-large (82.6 in paper) using recipes mentioned below - both taken from here.
I also saw your reply in #28 and also the FINETUNE.md page. I noticed the supervised recipe in paper is different from the fine-tune one. Hence two different recipes I tried.
pytorch version: 1.10.0 cuda: 11.3 timm version 0.5.0 setup is single host with 8 gpu. base lr = 1e-4. so effective batch size 4096 => effective lr = 0.0016 with recipe mentioned in paper: vit-base 82.3, I get 81.9
Following your advise in #28 and FINETUNE.md page, I got inspired to try this different recipe. I get 81.2 accuracy with below recipe here too single host with 8 gpu. base lr = 1e-3 so for batch size 1024 (as mentioned in finetune.py so => effective lr = 0.004 Also paper does not mention
layer-decay
or different--aa
strategyWhat is wrong in these recipes and why don't they replicate the scratch performance mentioned in the paper?