HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.12k stars 391 forks source link

Selection of teacher #2

Closed yaxingwang closed 4 years ago

yaxingwang commented 4 years ago

Hi @HobbitLong It is great work, I really like it. I have one issue about selection of teacher model. As shown in previous papers for classification, researcher use the frameworks which usually contains vgg16, vgg19, resent18, resent34, resnet50, resnet101 and so on. However, most teacher models you use are different. Can you explain the role you select teacher model? Is you method sensitive to framework. Please forgive me If I miss some parts.

HobbitLong commented 4 years ago

Hi, @yaxingwang,

Thanks for you interest.

The short answer is that the student-teacher combinations in this paper are randomly picked. I did not tune student-teacher combination such that our method could perform the best.

Your question actually reveals my initial thoughts of extending previous benchmark. I think most previous methods do not evaluate the performance on variance teacher-student combinations. For instance, Attention Transfer (AT) mainly consider WideResNet, and ResNet, and recent paper Similarity Preserving (SP) mainly consider WideResNet, MobileNet, and ShuffleNet. Additionally, most methods consider the teacher and student models to be of the same architectural type.

So I was trying to test those state-of-the-art methods with randomly picked various student-teacher combinations, and wanted to see how the performance would change. And I also divided these combinations into two groups: (1) of the same architectural type, or (2) of different types. As you may see from the two tables from the paper, some method significantly drops when dealing with group (2).

Lastly, though our CRD is the best among these randomly picked combinations in our paper, I think it's with high probability that it could not consistently be the best in all exhaustive combinations. But in general, it should not be very sensitive to different models, as you can see that we do not require specific architectural inductive bias in our algorithm.

yaxingwang commented 4 years ago

Thank you for your reply, @HobbitLong . It is good insight to compare the picked randomly student teacher pair. CRD mine additionally the structural knowledge, which is great complementary for KD.