facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.45k stars 723 forks source link

feat: add soft distillation #736

Open mattmazzola opened 1 year ago

mattmazzola commented 1 year ago

⚠️ This PR is not intended to be merged directly. Purpose to share features that may be useful for Metaseq ⚠️

Background

One of the main goals for our project's fork was to implement "soft" distillation (training on set of logprobs rather than correctness of token class) and to measure the efficacy of this technique compared to normal finetuning

From our docs:

The motivation for training on log probabilities rather than token classes is to pass as much knowledge from the teacher to the student as possible. [... By the teacher providing] log probabilities of other tokens in the vocabulary [we expect] the student better learn to represent the teacher’s knowledge.

Issue

Solution

Testing

Did not test

Related to #726

This feature was implemented by @anselmwang and @clarissesimoes