DD-DuDa / BitDistiller

[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
MIT License
85 stars 10 forks source link

Guidance/plans to add encoder/decoder model support for e.g T5 model? #10

Open mustavikhan05 opened 2 days ago

mustavikhan05 commented 2 days ago

I'm trying to adapt the bitdistiller code for encoder-decoder models.

Are there any plans to add support for this? Can some guidance be provided what parts need adaptation?

We're running a project to test the findings found in Table 5 where Llama 7B performed better as the teacher than 13B. We're testing the hypothesis you put forward across OPT models and now expanding our experiment to encoder-decoder models. Further, we're also running an experiment to sequentially introduce larger teachers. I.E self-distillation followed by a bigger model as teacher on the self-distilled model.

DD-DuDa commented 12 hours ago

It seems that all current cutting-edge models are decoder-only LLMs, so we might not have plans to support them. However, it may not be difficult to do so. You might just need to modify the loss calculation function. You can refer to the implementation in mytrainer.py.

The hypothesis is an interesting finding and worth further exploration. If you have any questions, feel free to contact me.