GeneZC / MiniMA

Code for paper titled "Towards the Law of Capacity Gap in Distilling Language Models"
Apache License 2.0
91 stars 5 forks source link

Distill Mistral 7B? #3

Open ojus1 opened 6 months ago

ojus1 commented 6 months ago

Mistral-7b is a much better model (and perhaps a teacher) than Llama-2-7b. Would you kindly release checkpoints for a distilled mistral? Would greatly appreciate it!

GeneZC commented 6 months ago

Thanks for your interests, we will consider using mistral-7b as an alternative teacher.

However, we are concerned that mistral-7b would make no difference from llama-2-7b since we cannot tell which pretraining data has been used by mistral-7b. And the data used for distillation would largely impact the results.