⚠️ This PR is not intended to be merged directly. Purpose to share features that may be useful for Metaseq ⚠️
Background
One of the main goals for our project's fork was to implement "soft" distillation (training on set of logprobs rather than correctness of token class) and to measure the efficacy of this technique compared to normal finetuning
From our docs:
The motivation for training on log probabilities rather than token classes is to pass as much knowledge from the teacher to the student as possible. [... By the teacher providing] log probabilities of other tokens in the vocabulary [we expect] the student better learn to represent the teacher’s knowledge.
Issue
Soft Distillation was not implemented
Solution
Add new pipeline task streaming_distillation_language_modeling
Add new criterion vocab_parallel_soft_cross_entropy (Note: Soft)
Considers multiple possible predictions for each token of the target sequence
Adds new parameters
--task streaming_distillation_language_modeling--distillation-mode logprobs_distillation--criterion vocab_parallel_soft_cross_entropy
Testing
Did not test
Related to #726
This feature was implemented by @anselmwang and @clarissesimoes
⚠️ This PR is not intended to be merged directly. Purpose to share features that may be useful for Metaseq ⚠️
Background
One of the main goals for our project's fork was to implement "soft" distillation (training on set of logprobs rather than correctness of token class) and to measure the efficacy of this technique compared to normal finetuning
From our docs:
Issue
Solution
streaming_distillation_language_modeling
vocab_parallel_soft_cross_entropy
(Note: Soft)--task streaming_distillation_language_modeling
--distillation-mode logprobs_distillation
--criterion vocab_parallel_soft_cross_entropy
Testing
Did not test
Related to #726
This feature was implemented by @anselmwang and @clarissesimoes