HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.11k stars 389 forks source link

KD method in both configurations seems to be doing better than all other methods except the one from your paper #26

Closed ksachdeva closed 3 years ago

ksachdeva commented 3 years ago

Hi,

Maybe I am not correctly understanding these below-mentioned tables (copied from your README at the root of the repo) but it seems that for almost all the configurations KD worked better than all the methods except the one proposed by you.

The purpose of all these methods originally was to improve upon KD (Hinton et al) so I am very surprised by this table.

Please guide

Regards & thanks Kapil

  1. Teacher and student are of the same architectural type.
Teacher
Student
wrn-40-2
wrn-16-2
wrn-40-2
wrn-40-1
resnet56
resnet20
resnet110
resnet20
resnet110
resnet32
resnet32x4
resnet8x4
vgg13
vgg8
Teacher
Student
75.61
73.26
75.61
71.98
72.34
69.06
74.31
69.06
74.31
71.14
79.42
72.50
74.64
70.36
KD 74.92 73.54 70.66 70.67 73.08 73.33 72.98
FitNet 73.58 72.24 69.21 68.99 71.06 73.50 71.02
AT 74.08 72.77 70.55 70.22 72.31 73.44 71.43
SP 73.83 72.43 69.67 70.04 72.69 72.94 72.68
CC 73.56 72.21 69.63 69.48 71.48 72.97 70.71
VID 74.11 73.30 70.38 70.16 72.61 73.09 71.23
RKD 73.35 72.22 69.61 69.25 71.82 71.90 71.48
PKT 74.54 73.45 70.34 70.25 72.61 73.64 72.88
AB 72.50 72.38 69.47 69.53 70.98 73.17 70.94
FT 73.25 71.59 69.84 70.22 72.37 72.86 70.58
FSP 72.91 0.00 69.95 70.11 71.89 72.62 70.23
NST 73.68 72.24 69.60 69.53 71.96 73.30 71.53
CRD 75.48 74.14 71.16 71.46 73.48 75.51 73.94
  1. Teacher and student are of different architectural type.
Teacher
Student
vgg13
MobileNetV2
ResNet50
MobileNetV2
ResNet50
vgg8
resnet32x4
ShuffleNetV1
resnet32x4
ShuffleNetV2
wrn-40-2
ShuffleNetV1
Teacher
Student
74.64
64.60
79.34
64.60
79.34
70.36
79.42
70.50
79.42
71.82
75.61
70.50
KD 67.37 67.35 73.81 74.07 74.45 74.83
FitNet 64.14 63.16 70.69 73.59 73.54 73.73
AT 59.40 58.58 71.84 71.73 72.73 73.32
SP 66.30 68.08 73.34 73.48 74.56 74.52
CC 64.86 65.43 70.25 71.14 71.29 71.38
VID 65.56 67.57 70.30 73.38 73.40 73.61
RKD 64.52 64.43 71.50 72.28 73.21 72.21
PKT 67.13 66.52 73.01 74.10 74.69 73.89
AB 66.06 67.20 70.65 73.55 74.31 73.34
FT 61.78 60.99 70.29 71.75 72.50 72.03
NST 58.16 64.96 71.28 74.12 74.68 74.89
CRD 69.73 69.11 74.30 75.11 75.65 76.05
HobbitLong commented 3 years ago

Hi, @ksachdeva,

It's a bit messy in evaluations of some previous papers. Some of them directly combine KD with their own object to claim outperformance over KD, while some others directly compare their single objective with KD.

The table you see here compares each single distillation objective without combining KD. As for the results of combining KD, please refer to the appendix.