⚠️ This PR is not intended to be merged directly. Purpose to share features that may be useful for Metaseq ⚠️

Background

One of the main goals for our project's fork was to implement "soft" distillation (training on set of logprobs rather than correctness of token class) and to measure the efficacy of this technique compared to normal finetuning

From our docs:

The motivation for training on log probabilities rather than token classes is to pass as much knowledge from the teacher to the student as possible. [... By the teacher providing] log probabilities of other tokens in the vocabulary [we expect] the student better learn to represent the teacher’s knowledge.

Issue

Soft Distillation was not implemented

Solution

Add new pipeline task streaming_distillation_language_modeling
- Add new criterion vocab_parallel_soft_cross_entropy (Note: Soft)
  - Considers multiple possible predictions for each token of the target sequence
- Adds new parameters --task streaming_distillation_language_modeling --distillation-mode logprobs_distillation --criterion vocab_parallel_soft_cross_entropy

Testing

Did not test

Related to #726

This feature was implemented by @anselmwang and @clarissesimoes

facebookresearch / metaseq

feat: add soft distillation #736

Background

Issue

Solution

Testing