Closed xuguodong03 closed 4 years ago
I have been struggling a bit to get KD work as well on ImageNet with ResNet-18 as the student network.
I used --distill kd -r 0.5 -a 0.9
and trained for 100 epochs in total with learning rate decayed at 30,60,90 epochs.
See below for training and testing curves, which I cropped out from previous manuscripts where CRD
here is historically named as CKD
:
Thanks for your great work.
When I conduct experiments on ImageNet, I use the same training hyper-parameters provided in PyTorch official examples.. The initial learning rate is 0.1 and it is decayed at 30,60 epoch respectively. I find that in first two stages, i.e. 1-30 epoch and 31-60 epoch, the standard KD has a higher accuracy than the student trained from scratch. But in the 3rd stage (61-90 epoch), KD's accuracy is lower than student trained from scratch. This phenomenon is exactly the same as that in the Figure 3 of this paper.
In your work, KD's top1 accuracy is 0.9 points better than student trained from scratch. I wonder if there are some special processes such as training scheme or hyper-parameters that are different from that in PyTorch official examples. It would be much better if you can provide your code for ImageNet.
Thanks in advance!