Hello, I have a question about training CIFAR-100

DeepLearningHB commented 3 years ago

HI, I read your paper impressively thank you.

I implemented your method by utilizing your code in here, and tested on CIFAR-100 but, in my case, gradient exploding has been occurred. It works well in first 15 epochs but, from 16 epoch, both accuracy and loss converge to 0.. Although I adjust learning rate more smaller (0.05 -> 0.01) I cannot solve gradient exploding problem.. How can I solve this?

thank you

DeepLearningHB commented 3 years ago

FYI: Epoch: [28][0/196] Time 0.330 (0.330) Data 0.283 (0.283) Loss 18.4588 (18.4588) Acc@1 75.391 (75.391) Acc@5 95.703 (95.703) Epoch: [28][100/196] Time 0.067 (0.070) Data 0.002 (0.006) Loss 18.1703 (17.9548) Acc@1 77.344 (77.023) Acc@5 93.750 (95.440) [Train] Acc@1 76.274 Acc@5 95.244 Test: [0/79] Time 0.042 (0.068) Loss 21.2584 (18.1365) Acc@1 63.281 (63.281) Acc@5 89.844 (89.844) Acc@1 61.430 Acc@5 87.290 Epoch: [29][0/196] Time 0.293 (0.293) Data 0.244 (0.244) Loss 17.1295 (17.1295) Acc@1 80.078 (80.078) Acc@5 95.312 (95.312) Epoch: [29][100/196] Time 0.067 (0.069) Data 0.002 (0.005) Loss 17.7287 (17.6846) Acc@1 79.688 (77.970) Acc@5 94.141 (95.796) [Train] Acc@1 45.784 Acc@5 57.934 Test: [0/79] Time 0.042 (0.068) Loss nan (nan) Acc@1 0.000 (0.000) Acc@5 3.125 (3.125) Acc@1 1.000 Acc@5 5.000 Epoch: [30][0/196] Time 0.332 (0.332) Data 0.286 (0.286) Loss nan (nan) Acc@1 0.781 (0.781) Acc@5 3.125 (3.125) Epoch: [30][100/196] Time 0.067 (0.070) Data 0.003 (0.006) Loss nan (nan) Acc@1 0.391 (1.002) Acc@5 3.516 (5.105)

woshichase commented 3 years ago

Hi, thanks for your attention. We haven't met the loss explosion problem. Apart from re-checking your training settings, I would suggest you also check if the baseline experiment (without soft loss) meet the same problem. If the baseline runs normally, the abnormity is likely to be caused by the soft loss. Then you can set the alpha (originally is 2.25 for Cifar-100) to smaller values, or check if (1-exp(-Ls/Lt)) is not ranged between (0,1) (usually it's not likely to happen).

DeepLearningHB commented 3 years ago

I solved this problem by adding small epsilon to focal_weight : ) It works well now!

bellymonster / Weighted-Soft-Label-Distillation

Hello, I have a question about training CIFAR-100 #4