HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.17k stars 395 forks source link

A question for Experimental result #6

Closed baek85 closed 4 years ago

baek85 commented 4 years ago

Thank you for share benchmark.

In your experimental result, There are many teacher and student pairs. Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs. Does performance change a lot with temperature difference?

There may be a similar problem with KD as well as other methods, how do you think about this?

HobbitLong commented 4 years ago

For all methods (including KD), I only tune hyper-parameters for one of the pair. After that, I kept those parameters fixed and evaluated on other pairs.

T=4 is what I found optimal and also consistent with previous works. I think you are right, it might be different for different pairs. But on the other hand, the point of this benchmark is to see the generalization ability of different methods, i.e., you can use the same hyper-parameters on different models but still get good performance.