Open NikAleksFed opened 2 years ago
mark
We only experimented with student models that have the same number of parameters as the teacher models. The method should also work for scenarios with different teacher and student architectures.
The baseline right directly uses the pre-trained teacher model for fine-tuning. We showed that using the student model works much better than this baseline.
@ancientmooner Hi small question regrad to ur answer. If the output feature map size between teacher and student is not the same, how can the feature map distilled in your method?
Hey guys, I need a clarification about situations, which this method must be applied to.
Am I right, that this method is best to use when our teacher model is much more complex than the student model? In this case we could get comparable accuracy with lower params for the student model.
Or we must use the same student and the teacher model architecture? But in this case I do not understand, why just not to perform fine-tuning using the pre-trained teacher network directly?