WangYZ1608 / Knowledge-Distillation-via-ND

The official implementation for paper: Improving Knowledge Distillation via Regularizing Feature Norm and Direction
13 stars 2 forks source link

weight of 3 losses for bigger teachers #8

Closed HanGuangXin closed 1 year ago

HanGuangXin commented 1 year ago

Hi, I'm interested in running KD++ on larger models such as RN50x4, RN50x16, RN50x64, ViT-L, and ViT-H. I'm wondering if you have any tips on selecting the weights for the three losses: classification loss (cls loss), knowledge distillation loss (kd loss), and norm direction loss (nd loss). It appears that as the teacher model grows in size, the weights of kd loss and nd loss also increase. Could you provide some guidance on this matter?

# KD++
# --------------------------------------------------------
# 1、ViT-S -  resnet18     1.0*cls + 1.5*kd + 1.0*nd
# 2、ViT-B -  resnet18     1.0*cls + 2.0*kd + 1.0*nd
# --------------------------------------------------------
# 1、resnet34 -  resnet18     1.0*cls + 2.5*kd + 1.0*nd
# 2、resnet50 -  resnet18     1.0*cls + 4.0*kd + 1.0*nd
# 3、resnet101 - resnet18     1.0*cls + 3.5*kd + 4.0*nd
# 4、resnet152 - resnet18     1.0*cls + 4.0*kd + 2.0*nd
# 5、resnet50 - mobilenetv1   1.0*cls + 4.0*kd + 1.0*nd
# --------------------------------------------------------
WangYZ1608 commented 1 year ago

Due to GPU memory limitations, we have not tried larger models. If you are interested in it, you can refer to Supplementary Material B.1.

HanGuangXin commented 1 year ago

Thanks, sincerely!