Hi, I'm interested in running KD++ on larger models such as RN50x4, RN50x16, RN50x64, ViT-L, and ViT-H. I'm wondering if you have any tips on selecting the weights for the three losses: classification loss (cls loss), knowledge distillation loss (kd loss), and norm direction loss (nd loss). It appears that as the teacher model grows in size, the weights of kd loss and nd loss also increase. Could you provide some guidance on this matter?
Hi, I'm interested in running KD++ on larger models such as RN50x4, RN50x16, RN50x64, ViT-L, and ViT-H. I'm wondering if you have any tips on selecting the weights for the three losses: classification loss (cls loss), knowledge distillation loss (kd loss), and norm direction loss (nd loss). It appears that as the teacher model grows in size, the weights of kd loss and nd loss also increase. Could you provide some guidance on this matter?