Choice of teacher model

Both MIT B2 and B5 are large-scale versions with the MLP decoder channel dimension C=768. According to the original paper, MIT B5 has three times the number of parameters (27.5M vs 84.7M) and twice the computational complexity (717.1 GFlops vs 1460.4 GFlops) of MIT B2, but only a 1.4% gain over MIT B2. Considering the training speed and the limited resources, MIT B2 is more suitable to verify the performance of our knowledge distillation framework.

RuipingL / TransKD

Choice of teacher model #3