arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.52k stars 397 forks source link

Try to add Qwen-moe into mixtral_moe.py #117

Open ZhangEnmao opened 8 months ago

ZhangEnmao commented 8 months ago

Hi, I try to add Qwen-moe into mixtral_moe.py, and I have done some modifications. But now, I meet some problems in there. 1 I think it is wrong, because auto_map should not appear in "MixtralForCausalLM". When I delete it, the model will be Nan. Do you know the reason? I am looking forward to your reply.

ZhangEnmao commented 8 months ago

Hi, Sorry to bother you again. Could you tell me why mixtral-moe only choose llama structure or mixtral structure ? Why are other models inappropriate

cg123 commented 8 months ago

No worries, happy to help! The reason the script needs a Llama or Mistral model is because it's written to take advantage of the Mixtral architecture. Because Mixtral is essentially just Mistral with multiple MLP sections and a gate, the tensors from a Mistral model can be used without any training. (Llama works as well because they're almost exactly the same architecture.)

It's definitely possible to combine other architectures in a similar fashion, but the result won't be compatible with the Mixtral architecture. There are two basic ways to make it work. You can get creative with how you use the weights of your models, throwing some out, and doing a bunch of training afterwards to rehabilitate it in the new architecture (CausalLM is a success story of this approach.) Mergekit can't really support this method as there's no easy way to automatically map the weights of an arbitrary language model architecture onto another - it really needs a human to decide that correlation.

The other approach is to not use the Mixtral architecture, and instead write your own custom code to inference the resulting model. Maxime Labonne's Phixtral models are examples of this approach. Similarly, this can't really be automated. I can look at integrating new architectures as they are implemented - for example, now that Phixtral is getting some traction I'm considering extending the script to also be able to output Phixtral models. But the actual inference code I can't really help with - I'm only one person, and if I start writing custom MoE architectures for every type of model out there I'd never have time to do anything else. :)

ZhangEnmao commented 8 months ago

Oh, you are truly amazing! Your answer has been of great help to me, and I feel like I have gained a deeper understanding of MergeKit and MOE. If you are expanding the Phixtral architecture, I believe it would require some special code related to Phixtral model features (which I would also need to obtain Qwen-moe). Currently, I have made some very simple modifications to mixtral_moe.py, but it doesn't give me a mixtral-moe architecture, probably because it's too simplistic. I will further contemplate on how to incorporate Qwen. Thank you for your response, and I'm looking forward to the expansion of Phixtral!

ZhangEnmao commented 8 months ago

Hey, bro. Good morning ! I have an idea now which is a Qwen-moe.py file may be necessary, just like Qwen model owing its Qwen.py file to help loading pretrained model correctly. Do you think my idea is right ?