astramind-ai / Mixture-of-depths

Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
129 stars 7 forks source link

training question #1

Closed bobzhang208 closed 6 months ago

bobzhang208 commented 6 months ago

Is it possible to only train the router without change the other weight in LLM?

mlinmg commented 6 months ago

Yes it is theoretically possible but our own preliminary testing showed that it is quite unstable. In they paper they call for a full pretrain. You can try to do small updates to the existing layer with techniques such as LoRA to mitigate the instability. We are currently integrating this package in Llama-Factory which will enable users to do exactly that

mlinmg commented 6 months ago

Closing for inactivity