I would like to distill smollm-360m-instruct with another multilingual llama model as the teacher. While both the models are based on same architecture(llama 2), the multilingual model has a vocabulary quite different from smollm-360m-instruct. Which distillation should I use- logit based or hidden-state based? Also, will it be possible to increase the number of tokens in student model(smollm-360m) for better generation?
I would like to distill smollm-360m-instruct with another multilingual llama model as the teacher. While both the models are based on same architecture(llama 2), the multilingual model has a vocabulary quite different from smollm-360m-instruct. Which distillation should I use- logit based or hidden-state based? Also, will it be possible to increase the number of tokens in student model(smollm-360m) for better generation?