Closed Jaggie-Yan closed 4 months ago
Please use the setting mentioned in our paper (fine-tune 20 epochs with 2 epochs for warmup). Also, we have released our training log, maybe you can align the learning rate with us. (peak learning rate is 1e-5 for batch size 32).
Okay, thanks for your quick comment!
Thanks for sharing your code!
In your code, the number of iterations for fine-tuning is set to 50, with 5 rounds of warm-up, but in paper they are 20 and 2 respectively. Is this a mistake?
In addition, 1) I loaded a pre-trained model with 55M_kd.pth on 6 H100s (batch size is 56 on each card). 2) Using caformer_b36_in21_ft1k.pth for knowledge distillation. 3) Set 20 rounds of training, including 2 rounds of warm-up (according to the parameters setting in the original paper). In the end only 79.2% accuracy was obtained, not 80%, and I think the difference of 0.8% is a bit big. Is there any detail I haven't noticed in this process? It would be better if you could give some suggestions for reproduction? Thanks.