Open ojus1 opened 10 months ago
Thanks for your interests, we will consider using mistral-7b as an alternative teacher.
However, we are concerned that mistral-7b would make no difference from llama-2-7b since we cannot tell which pretraining data has been used by mistral-7b. And the data used for distillation would largely impact the results.
Mistral-7b is a much better model (and perhaps a teacher) than Llama-2-7b. Would you kindly release checkpoints for a distilled mistral? Would greatly appreciate it!