Guidance/plans to add encoder/decoder model support for e.g T5 model?

DD-DuDa / BitDistiller

[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.

MIT License

85 stars 10 forks source link

I'm trying to adapt the bitdistiller code for encoder-decoder models.

Are there any plans to add support for this? Can some guidance be provided what parts need adaptation?

We're running a project to test the findings found in Table 5 where Llama 7B performed better as the teacher than 13B. We're testing the hypothesis you put forward across OPT models and now expanding our experiment to encoder-decoder models. Further, we're also running an experiment to sequentially introduce larger teachers. I.E self-distillation followed by a bigger model as teacher on the self-distilled model.

DD-DuDa / BitDistiller

Guidance/plans to add encoder/decoder model support for e.g T5 model? #10