Knowledge distillation on Transformer

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.17k stars 6.37k forks source link

Knowledge distillation on Transformer #570

Closed sugeeth14 closed 2 years ago

sugeeth14 commented 5 years ago

Hi, I trained a transformer model for English to German translation for using the instructions presented here. Now I want to train a smaller model using Knowledge distillation mentioned in this paper. Is such thing supported in Fairseq if not how do I get the logits or soft targets from my model so that I can train a smaller model and is this possible in my case. If any one has tried KD on transformer any ideas or suggestions are welcome. Thank you.

spprabhu commented 4 years ago

Hi @Raghava14 were you able to obtain a KD model for En-De.If yes how.

Ir1d commented 4 years ago

Hi, I'm trying to use KD for en-de translation. In the doc it said decode the training set to produce a distillation dataset. Could you give me some hint on how to obtain this distillation dataset after I train my model?

RamoramaInteractive commented 2 years ago

Hi @Ir1d, this is exactly the same problem I've encountered. There's no hint how to decode the training set to produce a distillation sataset.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!