RuipingL / TransKD

15 stars 3 forks source link

Choice of teacher model #3

Closed Jayden9912 closed 1 year ago

Jayden9912 commented 1 year ago

Is there any particular reason for choosing mit_b2 over mit_b5 as the teacher model?

RuipingL commented 1 year ago

Both MIT B2 and B5 are large-scale versions with the MLP decoder channel dimension C=768. According to the original paper, MIT B5 has three times the number of parameters (27.5M vs 84.7M) and twice the computational complexity (717.1 GFlops vs 1460.4 GFlops) of MIT B2, but only a 1.4% gain over MIT B2. Considering the training speed and the limited resources, MIT B2 is more suitable to verify the performance of our knowledge distillation framework.