distill model in full-prec mode

Tencent / PocketFlow

An Automatic Model Compression (AutoMC) framework for developing smaller and faster AI applications.

https://pocketflow.github.io

Other

2.79k stars 490 forks source link

distill model in full-prec mode #55

Closed as754770178 closed 5 years ago

as754770178 commented 5 years ago

In full-prec mode, DistillationHelper create leaner and FullPrecLearner use the same model_helper. I think the distilled model is the same model as primary model, so that it's not distilling. I suggest that provide another model_helper as the distilled model.

jiaxiang-wu commented 5 years ago

They only share the same architecture, but not model weights due to different initialization. In practice, training a full-precision learner with distillation loss can indeed improve the performance. Below is a comparison on the classification accuracy of models trained with or without distillation loss, on CIFAR-10:

Model	Distillation?	Accuracy	Improvement
ResNet-20		91.93%
ResNet-20	+	93.10%	1.07%
ResNet-32		92.59%
ResNet-32	+	93.44%	0.85%
ResNet-44		92.76%
ResNet-44	+	93.71%	0.95%
ResNet-56		93.23%
ResNet-56	+	94.01%	0.78%

as754770178 commented 5 years ago

OK, thanks

as754770178 commented 5 years ago

The distill result is well when I distill model in small dataset, but when I distill resnet_v1_50 in ImageNet, the result is not good. The top-5 of teacher model is 92.7% , but top-5 of student model is 92%.

jiaxiang-wu commented 5 years ago

For ImageNet, we discover that distillation is sometimes helpful for training with compression algorithms, e.g. uniform quantization.
It is also observed in some papers that simply using distillation on the classification logits can be harmful for training a full-precision model, e.g. Table 5 in [1]. Distilling with the attention map [2] or factor vector [1] may be a better choice.

[1] Jangho Kim, SeongUk Park, and Nojun Kwak, Paraphrasing Complex Network: Network Compression via Factor Transfer, NIPS 2018. [2] Sergey Zagoruyko and Nikos Komodakis, Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer, ICLR 2017.

as754770178 commented 5 years ago

Thanks, I don't compress the model . I only want to finetune resnet_v1_50 to get high acuuracy. Can I use distillation.

jiaxiang-wu commented 5 years ago

Then you can try out the above two papers' algorithms (not implemented in PocketFlow).

Attention Transfer's official implementation: https://github.com/szagoruyko/attention-transfer

as754770178 commented 5 years ago

thanks