NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review -- Model Distillation: Distilling the Knowledge in a Neural Network (Image Classification). #89

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Model Distillation: Distilling the Knowledge in a Neural Network (Image Classification).

NorbertZheng commented 1 year ago

Overview

Inthis story, Distilling the Knowledge in a Neural Network, by Google Inc., is briefly reviewed. This is a paper by Prof. Hinton.

In this paper, the knowledge in an ensemble of models is distilled into a single model. This is a paper in 2014 NIPS with over 5000 citations.

NorbertZheng commented 1 year ago

Higher Temperature for Model Distillation

Higher Temperature for Soft Targets

Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, $z{i}$, computed for each class into a probability, $q{i}$, by comparing $z_{i}$ with the other logits: image where $T$ is a temperature that is normally set to $1$.

This is useful since much of the information about the learned function resides in the ratios of very small probabilities in the soft targets.

NorbertZheng commented 1 year ago

E.g. do not transform the predicted logits into one-hot vectors, but directly use the soft logits to compute cross-entropy loss?

NorbertZheng commented 1 year ago

The Calculation of Gradients

Each case in the transfer set contributes a cross-entropy gradient, $\frac{dC}{dz{i}}$, with respect to each logit, $z{i}$ of the distilled model.

If the cumbersome model has logits $v{i}$ which produces soft target probabilities $p{i}$ and the transfer training is done at a temperature of $T$. The gradient is given by: image

If the temperature is high compared with the magnitude of the logits, it can be approximated as (using $e^{x}\sim 1+x$: image

Assuming that the logits $z$ and $v$ have been zero-meaned: image

The gradient can be further simplified as: image

It is later found that when the distilled model is much too small to capture all of the knowledge in the cumbersome model, intermediate temperatures work best.

NorbertZheng commented 1 year ago

Experimental Results

MNIST

A single large neural net with two hidden layers of 1200 rectified linear hidden units on all 60,000 training cases. Dropout is used. This net achieved 67 test errors.

A smaller net with two hidden layers of 800 rectified linear hidden units and no regularization achieved 146 errors.

If the smaller net was regularized solely by adding the additional task of matching the soft targets produced by the large net at a temperature of 20, it achieved 74 test errors.

When the distilled net had 300 or more units in each of its two hidden layers, all temperatures above 8 gave fairly similar results. But when this was radically reduced to 30 units per layer, temperatures in the range 2.5 to 4 worked significantly better than higher or lower temperatures.

NorbertZheng commented 1 year ago

Speech Recognition

image Frame classification accuracy and Word Error Rate (WER).

An architecture with 8 hidden layers each containing 2560 rectified linear units and a final softmax layer with 14,000 labels (HMM targets $h_{t}$) is used.

The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame.

The total number of parameters is about 85M.

To train the DNN acoustic model we use about 2000 hours of spoken English data, which yields about 700M training examples. This system achieves a frame accuracy of 58.9%, and a Word Error Rate (WER) of 10.9% on our development set.

The ensemble gives a smaller improvement on the ultimate objective of WER (on a 23K-word test set) due to the mismatch in the objective function, but again, the improvement in WER achieved by the ensemble is transferred to the distilled model.

NorbertZheng commented 1 year ago

JFT

image Classification accuracy (top 1) on the JFT development set.

JFT is an internal Google dataset that has 100 million labeled images with 15,000 labels.

AlexNet needs to be trained using 6 months. Waiting for several years to train an ensemble of models was not an option.

61 specialist models are trained, each with 300 classes.

At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.

NorbertZheng commented 1 year ago

Soft Targets as Regularizers

image Frame classification accuracy and Word Error Rate (WER).

Soft targets allow a new model to generalize well from only 3% of the training set.

NorbertZheng commented 1 year ago

Reference