Closed NorbertZheng closed 1 year ago
Inthis story, Distilling the Knowledge in a Neural Network, by Google Inc., is briefly reviewed. This is a paper by Prof. Hinton.
In this paper, the knowledge in an ensemble of models is distilled into a single model. This is a paper in 2014 NIPS with over 5000 citations.
Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, $z{i}$, computed for each class into a probability, $q{i}$, by comparing $z_{i}$ with the other logits: where $T$ is a temperature that is normally set to $1$.
This is useful since much of the information about the learned function resides in the ratios of very small probabilities in the soft targets.
E.g. do not transform the predicted logits into one-hot vectors, but directly use the soft logits to compute cross-entropy loss?
Each case in the transfer set contributes a cross-entropy gradient, $\frac{dC}{dz{i}}$, with respect to each logit, $z{i}$ of the distilled model.
If the cumbersome model has logits $v{i}$ which produces soft target probabilities $p{i}$ and the transfer training is done at a temperature of $T$. The gradient is given by:
If the temperature is high compared with the magnitude of the logits, it can be approximated as (using $e^{x}\sim 1+x$:
Assuming that the logits $z$ and $v$ have been zero-meaned:
The gradient can be further simplified as:
It is later found that when the distilled model is much too small to capture all of the knowledge in the cumbersome model, intermediate temperatures work best.
A single large neural net with two hidden layers of 1200 rectified linear hidden units on all 60,000 training cases. Dropout is used. This net achieved 67 test errors.
A smaller net with two hidden layers of 800 rectified linear hidden units and no regularization achieved 146 errors.
If the smaller net was regularized solely by adding the additional task of matching the soft targets produced by the large net at a temperature of 20, it achieved 74 test errors.
When the distilled net had 300 or more units in each of its two hidden layers, all temperatures above 8 gave fairly similar results. But when this was radically reduced to 30 units per layer, temperatures in the range 2.5 to 4 worked significantly better than higher or lower temperatures.
Frame classification accuracy and Word Error Rate (WER).
An architecture with 8 hidden layers each containing 2560 rectified linear units and a final softmax layer with 14,000 labels (HMM targets $h_{t}$) is used.
The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame.
The total number of parameters is about 85M.
To train the DNN acoustic model we use about 2000 hours of spoken English data, which yields about 700M training examples. This system achieves a frame accuracy of 58.9%, and a Word Error Rate (WER) of 10.9% on our development set.
The ensemble gives a smaller improvement on the ultimate objective of WER (on a 23K-word test set) due to the mismatch in the objective function, but again, the improvement in WER achieved by the ensemble is transferred to the distilled model.
Classification accuracy (top 1) on the JFT development set.
JFT is an internal Google dataset that has 100 million labeled images with 15,000 labels.
AlexNet needs to be trained using 6 months. Waiting for several years to train an ensemble of models was not an option.
61 specialist models are trained, each with 300 classes.
At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.
Frame classification accuracy and Word Error Rate (WER).
Soft targets allow a new model to generalize well from only 3% of the training set.
Sik-Ho Tang. Review — Model Distillation: Distilling the Knowledge in a Neural Network (Image Classification).