This is an implementation for the basic idea behind Hinton's Knowledge Distillation Paper. We do not reproduce the exact results but rather show that the idea works.
While a few other implementations are available, the code flow is not very intuitive. Here we generate the soft targets from the teacher in an on-line manner while training the student network.
The big and small models (with some modification - We currently have a simple softmax regression as in TF's tutorial) have been taken from here.
While this may not (or may) be a good way to implement the distillation architecture, it leads to a good improvement in the (small) student model. In case you find any bug or have any suggestions feel free to create an issue or even send in a pull request.
Tensorflow 1.3 or above
Train the Teacher Model
python main.py --model_type teacher --checkpoint_dir teachercpt --num_steps 5000 --temperature 5
Train the Student Model (in a standalone manner for comparison)
python main.py --model_type student --checkpoint_dir studentcpt --num_steps 5000
Train the Student Model (Using Soft Targets from the teacher model)
python main.py --model_type student --checkpoint_dir studentcpt --load_teacher_from_checkpoint true --load_teacher_checkpoint_dir teachercpt --num_steps 5000 --temperature 5
Model | Accuracy - 2 | Accuracy - 5 |
---|---|---|
Teacher Only | 97.9 | 98.12 |
Distillation | 89.14 | 90.77 |
Student Only | 88.84 | 88.84 |
The small model when trained without the soft labels always use temperature=1.