Open mustavikhan05 opened 2 days ago
It seems that all current cutting-edge models are decoder-only LLMs, so we might not have plans to support them. However, it may not be difficult to do so. You might just need to modify the loss calculation function. You can refer to the implementation in mytrainer.py.
The hypothesis is an interesting finding and worth further exploration. If you have any questions, feel free to contact me.
I'm trying to adapt the bitdistiller code for encoder-decoder models.
Are there any plans to add support for this? Can some guidance be provided what parts need adaptation?
We're running a project to test the findings found in Table 5 where Llama 7B performed better as the teacher than 13B. We're testing the hypothesis you put forward across OPT models and now expanding our experiment to encoder-decoder models. Further, we're also running an experiment to sequentially introduce larger teachers. I.E self-distillation followed by a bigger model as teacher on the self-distilled model.