Didn't get the purpose of the method

SwinTransformer / Feature-Distillation

MIT License

244 stars 11 forks source link

Didn't get the purpose of the method #3

Open NikAleksFed opened 2 years ago

NikAleksFed commented 2 years ago

Hey guys, I need a clarification about situations, which this method must be applied to.

Am I right, that this method is best to use when our teacher model is much more complex than the student model? In this case we could get comparable accuracy with lower params for the student model.

Or we must use the same student and the teacher model architecture? But in this case I do not understand, why just not to perform fine-tuning using the pre-trained teacher network directly?

wondering516 commented 2 years ago

mark

ancientmooner commented 2 years ago

We only experimented with student models that have the same number of parameters as the teacher models. The method should also work for scenarios with different teacher and student architectures.

The baseline right directly uses the pre-trained teacher model for fine-tuning. We showed that using the student model works much better than this baseline.

jihwanp commented 2 years ago

@ancientmooner Hi small question regrad to ur answer. If the output feature map size between teacher and student is not the same, how can the feature map distilled in your method?