Open amiller195 opened 3 years ago
Using KL divergence instead of CE, and rescaling KL divergence into normal loss ranges - distillation setup details in Sec 4.4.
Hi, thank you for the great work!
Sorry I also have the same question as above and wonder if the question is resolved.
I couldn't reproduce the accuracy on Imagenet with the 140k images provided. I only can reach over 30% top-1 accuracy as followed in Sec 4.4 from the paper. My training setups include: batch size 256, temperature 3, KL loss only (only relies on teacher logits), 250 epochs, learning rate 1.0 and SGD with a decay step of every 80 epochs.
Many thanks!
same question
Hi, Very interesting work! According to Table 6 in the paper, training for 90 epochs with the 140K generated dataset should reach top-1 accuracy of 68.0%. I'm trying to train Resnet50v1.5 based on the protocol here https://github.com/NVIDIA/DeepLearningExamples with the 140k dataset, can't pass top-1 accuracy of 10%.
Can you please elaborate on the training process using the generated 140k images? What protocol or additional work was required to reach the mentioned accuracy?
Thanks!