Closed baek85 closed 4 years ago
For all methods (including KD), I only tune hyper-parameters for one of the pair. After that, I kept those parameters fixed and evaluated on other pairs.
T=4 is what I found optimal and also consistent with previous works. I think you are right, it might be different for different pairs. But on the other hand, the point of this benchmark is to see the generalization ability of different methods, i.e., you can use the same hyper-parameters on different models but still get good performance.
Thank you for share benchmark.
In your experimental result, There are many teacher and student pairs. Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs. Does performance change a lot with temperature difference?
There may be a similar problem with KD as well as other methods, how do you think about this?