Closed as754770178 closed 5 years ago
They only share the same architecture, but not model weights due to different initialization. In practice, training a full-precision learner with distillation loss can indeed improve the performance. Below is a comparison on the classification accuracy of models trained with or without distillation loss, on CIFAR-10:
Model | Distillation? | Accuracy | Improvement |
---|---|---|---|
ResNet-20 | 91.93% | ||
ResNet-20 | + | 93.10% | 1.07% |
ResNet-32 | 92.59% | ||
ResNet-32 | + | 93.44% | 0.85% |
ResNet-44 | 92.76% | ||
ResNet-44 | + | 93.71% | 0.95% |
ResNet-56 | 93.23% | ||
ResNet-56 | + | 94.01% | 0.78% |
OK, thanks
The distill result is well when I distill model in small dataset, but when I distill resnet_v1_50
in ImageNet
, the result is not good. The top-5 of teacher model is 92.7%
, but top-5 of student model is 92%
.
[1] Jangho Kim, SeongUk Park, and Nojun Kwak, Paraphrasing Complex Network: Network Compression via Factor Transfer, NIPS 2018. [2] Sergey Zagoruyko and Nikos Komodakis, Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer, ICLR 2017.
Thanks, I don't compress the model . I only want to finetune resnet_v1_50
to get high acuuracy. Can I use distillation.
Then you can try out the above two papers' algorithms (not implemented in PocketFlow).
thanks
In
full-prec
mode,DistillationHelper
create leaner andFullPrecLearner
use the samemodel_helper
. I think the distilled model is the same model asprimary model
, so that it's not distilling. I suggest that provide anothermodel_helper
as the distilled model.