arcee-ai / DistillKit

An Open Source Toolkit For LLM Distillation
GNU Affero General Public License v3.0
337 stars 36 forks source link

Models with same architecture but different tokenizer #17

Open bil-ash opened 2 days ago

bil-ash commented 2 days ago

I would like to distill smollm-360m-instruct with another multilingual llama model as the teacher. While both the models are based on same architecture(llama 2), the multilingual model has a vocabulary quite different from smollm-360m-instruct. Which distillation should I use- logit based or hidden-state based? Also, will it be possible to increase the number of tokens in student model(smollm-360m) for better generation?