Reproduce the results of the finetune code.

Jaggie-Yan commented 4 months ago

Thanks for sharing your code!

In your code, the number of iterations for fine-tuning is set to 50, with 5 rounds of warm-up, but in paper they are 20 and 2 respectively. Is this a mistake?

In addition, 1) I loaded a pre-trained model with 55M_kd.pth on 6 H100s (batch size is 56 on each card). 2) Using caformer_b36_in21_ft1k.pth for knowledge distillation. 3) Set 20 rounds of training, including 2 rounds of warm-up (according to the parameters setting in the original paper). In the end only 79.2% accuracy was obtained, not 80%, and I think the difference of 0.8% is a bit big. Is there any detail I haven't noticed in this process? It would be better if you could give some suggestions for reproduction? Thanks.

jkhu29 commented 4 months ago

Please use the setting mentioned in our paper (fine-tune 20 epochs with 2 epochs for warmup). Also, we have released our training log, maybe you can align the learning rate with us. (peak learning rate is 1e-5 for batch size 32).

Jaggie-Yan commented 4 months ago

Okay, thanks for your quick comment!

BICLab / Spike-Driven-Transformer-V2

Reproduce the results of the finetune code. #8