HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.12k stars 391 forks source link

Results on ImageNet #10

Closed xuguodong03 closed 4 years ago

xuguodong03 commented 4 years ago

Thanks for your great work.

When I conduct experiments on ImageNet, I use the same training hyper-parameters provided in PyTorch official examples.. The initial learning rate is 0.1 and it is decayed at 30,60 epoch respectively. I find that in first two stages, i.e. 1-30 epoch and 31-60 epoch, the standard KD has a higher accuracy than the student trained from scratch. But in the 3rd stage (61-90 epoch), KD's accuracy is lower than student trained from scratch. This phenomenon is exactly the same as that in the Figure 3 of this paper.

In your work, KD's top1 accuracy is 0.9 points better than student trained from scratch. I wonder if there are some special processes such as training scheme or hyper-parameters that are different from that in PyTorch official examples. It would be much better if you can provide your code for ImageNet.

Thanks in advance!

HobbitLong commented 4 years ago

I have been struggling a bit to get KD work as well on ImageNet with ResNet-18 as the student network.

I used --distill kd -r 0.5 -a 0.9 and trained for 100 epochs in total with learning rate decayed at 30,60,90 epochs.

See below for training and testing curves, which I cropped out from previous manuscripts where CRD here is historically named as CKD: imagenet