Hi, I cannot reproduce your reported performance on CIFAR-100.

DeepLearningHB commented 3 years ago

Hi there, I'm trying to use your method on CIFAR-100. However, I cannot reproduce your performance even if I followed your script and hyper-parameter settings. for instance, ResNet110-ResNet32 pair showed 74.12% on your paper but in my implementation they showed only 72.91. I was able to reproduce your performance with respect to only resnet56-resnet20 (72.01 / 72.15) I think it's quite high performance gap between yours and mine. In addition, your repository only contains ImageNet training script. If you don't mind uploading CIFAR-100 training script, I may train your method ..

Thanks!

woshichase commented 3 years ago

Hi！ We believe it's more convincing to reproduce the result on large-scale dataset, such as ImageNet, so we choose to organize the ImageNet training code and upload. If you had problems when running on Cifar100 and preserved the training logs (better contraining all training hyper-params, training loss/acc)，you can provide them to me and we can check where went wrong.
Thanks for your attention!

songshucode commented 2 years ago

Thanks for your sharing. But I mee the same problem too on CIFAR100. I use the code and the hyper-parameter you shared. For instance, resnet32x4-resnet8x4 is 75.07%, but the result on your paper is 76.05%. I don't know how to address this issue. Can you give me some suggestions about the hyper-parameter or code on CIFAR 100?

Thanks!

woshichase commented 2 years ago

@songshucode Hi，Thanks for your attention! Averaged over 5 runs? For Cifar-100, the alpha is set to 2.25, and the temperature is 4. Note there is a default 4^2 loss weight, so the total loss weight for wsl is 2.25*4^2=36.

songshucode commented 2 years ago

@woshichase Thanks for your reply. The results of 5 runs are as follows.

test 1	test 2	test 3	test 4	test 5
75.15	75.15	75.03	75.18	74.85

The code about the loss function is copied from your sharing code, they are as follows:

fc_t = logits
out = self.student(x)

s_input_for_softmax = out / self.temperature
t_input_for_softmax = fc_t / self.temperature

t_soft_label = self.softmax(t_input_for_softmax)

softmax_loss = - torch.sum(t_soft_label * self.logsoftmax(s_input_for_softmax), 1, keepdim=True)

out_auto = out.detach()
fc_t_auto = fc_t.detach()
log_softmax_s = self.logsoftmax(out_auto)
log_softmax_t = self.logsoftmax(fc_t_auto)
one_hot_label = F.one_hot(y, num_classes=100).float()
softmax_loss_s = - torch.sum(one_hot_label * log_softmax_s, 1, keepdim=True)
softmax_loss_t = - torch.sum(one_hot_label * log_softmax_t, 1, keepdim=True)

focal_weight = softmax_loss_s / (softmax_loss_t)
ratio_lower = torch.zeros(1).cuda()
focal_weight = torch.max(focal_weight, ratio_lower)
focal_weight = 1 - torch.exp(- focal_weight)
softmax_loss = focal_weight * softmax_loss

soft_loss = (self.temperature ** 2) * torch.mean(softmax_loss)

hard_loss = self.hard_loss(out, y)

loss = hard_loss + self.alpha * soft_loss

where, the alpha = 2.25, temperature=4.0 Thanks for your attention!

woshichase commented 2 years ago

@songshucode Hi, Thanks for your sharing. I currently find no flaws in your loss function code or hyper-params. I suggest you can make efforts such as on the training repo selection.To keep consistency with the ImageNet experiment, we also run Cifar-100 on OverHaul repo instead of CRD repo and move the CRD Cifar-100 training settings to Overhaul. Also, the pretrained teachers are re-trained on Overhaul with the same settings as training the student.

woshichase commented 2 years ago

Any further comments?

bellymonster / Weighted-Soft-Label-Distillation

Hi, I cannot reproduce your reported performance on CIFAR-100. #6