Open PhilipMay opened 7 months ago
customisation should be required, I don't think phi is currently supported.
https://huggingface.co/mlabonne/phixtral-4x2_8 mlabonne said he used mergekit, but i am running into the same error, even when using the fork on his github repo.
https://huggingface.co/mlabonne/phixtral-4x2_8 mlabonne said he used mergekit, but i am running into the same error, even when using the fork on his github repo.
I think this is from an old mixtral version. When I try his exact config I get this error:
ERROR:root:Your positive and negative prompts are identical for all experts. This will not produce a functioning MoE.
ERROR:root:For each expert, `positive_prompts` must contain one or more example prompt reflecting what should be routed to that expert.
I found a fix for this:
Changing this line: https://github.com/arcee-ai/mergekit/blob/d0f5ad466ea9caaf3c997f27e1695a32d68e147f/mergekit/scripts/mixtral_moe.py#L324
to this:
MISTRAL_INFO = mergekit.architecture.PHI2_INFO_AGAIN_BUT_DIFFERENT
Solves the problem. It seems like the architecture is hard wired.
But how does a clean fix look like? @cg123
@NickWithBotronics this fix is even better: #150
@NickWithBotronics this fix is even better: #150
Thank you so much! I can work on a project Idea I had, and use phi like I intended, not minichat 3b. Gratitude +100.
Also you are remarkably quick, during my research, this has basically been a issue/discussion on getting this to work since phi was released.
Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)
Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)
Will do! As far as I know mlabonne, used a “custom merge kit(mixtral)” I was able to find a fork on his github for it, but it didn’t work when I tried it, so I do believe it was a fix he did, and possibly kept it private?
Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)
Unfortunately, I have some bad news. Tested it out and got a non moe safe tensor file, I know cause its the same size as the regular phi. but i got a config file, merge config file, and merges file. No code stopping errors though.
Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)
Unfortunately, I have some bad news. Tested it out and got a non moe safe tensor file, I know cause its the same size as the regular phi. but i got a config file, merge config file, and merges file. No code stopping errors though.
So you say the model was not really merged?
exactly, does it matter what code phi is using? i can test it out tmr, i used microsoft/phi2 and i have 2 snapshots installed no clue which it used. or what changes were made on those snapshots.
@NickWithBotronics a bit offtopic: For research you could use tiny-llama instead of phi-2. That should work 100%.
@NickWithBotronics a bit offtopic: For research you could use tiny-llama instead of phi-2. That should work 100%.
I saw minichat, got better benchmarks , however I haven't real world tested it, however, I could split my dataset by 1/10 the size for small testing with tiny llama before going ham with the full dataset.
exactly, does it matter what code phi is using? i can test it out tmr, i used microsoft/phi2 and i have 2 snapshots installed no clue which it used. or what changes were made on those snapshots.
@NickWithBotronics so did you test the merged model now? does the merge-moe from the mixtral branch work for phi2 architecture?
@mlabonne is your Phixtral fork public?
@fakerybakery Still not unfortunately, need to find some time to work on it.
Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)
Unfortunately, I have some bad news. Tested it out and got a non moe safe tensor file, I know cause its the same size as the regular phi. but i got a config file, merge config file, and merges file. No code stopping errors though.
@NickWithBotronics I think you might need to tweak these lines. https://github.com/arcee-ai/mergekit/blob/dbb2eebf4ff0c21bda6069cf6de0e3e3b249f82e/mergekit/scripts/mixtral_moe.py#L337-L363
This code checks whether ".mlp." is in tensor_name,
and if it is, it adds a new tensor from the expert models by replacing tensor name.
However, the weight names for the Phi-2 model's MLP layers, such as "model.layers.0.mlp.fc1.bias", "model.layers.0.mlp.fc1.weight", "model.layers.0.mlp.fc2.bias", and "model.layers.0.mlp.fc2.weight", mean that expert_name
won't be replaced and remains the same as the original tensor_name
. This discrepancy likely leads to the new weights not being added as intended, but rather to the overwriting of existing weights, so final model size remains the same as the original.
@Aratako so change line 338 by looking into PHI2_INFO_AGAIN_BUT_DIFFERENT instead of MISTRAL_INFO?
@v-prgmr Yes, I think that would be the correct approach. Though, the repository's structure has recently undergone changes, and the architecture definitions are now located in this JSON file: mergekit/_data/architectures/phi2.json
You would likely need to adjust the replace
statements to align with the naming conventions of the Phi-2 model weights. (I'm not entirely sure if this is the correct approach, or how exactly to rewrite them, though.)
Has anyone been success ful in merging any modl with Phi2 or itself even. Ive tried LaMini and Qwen 1.5 and bith show the error
KeyError: 'model.embed_tokens.weight'
@cg123 maybe can you give me a hint or two on what needs to be done for phi-2, I can then try to work on fixing this and submit a PR. At the moment, I don't know the head and tail of the mergekit-moe
Hello, I have not worked with the MoE model for Phi-2, but I have created an MoE model for Qwen2. The code for merging is available in this repository: https://github.com/Aratako/mergekit-qwen2. The script for merging Qwen2 can be found at mergekit/scripts/qwen2_moe.py (I have also made adjustments to pyproject.toml so it can be executed from the command line).
The MoE models created using this script are published here:
To use the MoE models outputted by this script, customized modeling and configuration files are required. In my case, I've added MoE-related classes to the modeling_qwen2.py and configuration_qwen2.py from the transformers library. My implementation was inspired by what's done in https://huggingface.co/mlabonne/phixtral-4x2_8. Specifically, I have added the following class:
class Qwen2MoE(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.num_local_experts = config.num_local_experts
self.num_experts_per_tok = config.num_experts_per_tok
self.mlp = nn.ModuleList(
[Qwen2MLP(config) for i in range(self.num_local_experts)]
)
self.gate = nn.Linear(self.hidden_size, self.num_local_experts, bias=False)
def forward(self, x):
orig_shape = x.shape
x = x.view(-1, x.shape[-1])
scores = self.gate(x)
expert_weights, expert_indices = torch.topk(
scores, self.num_experts_per_tok, dim=-1
)
expert_weights = expert_weights.softmax(dim=-1)
flat_expert_indices = expert_indices.view(-1)
x = x.repeat_interleave(self.num_experts_per_tok, dim=0)
y = torch.empty_like(x)
for i, expert in enumerate(self.mlp):
y[flat_expert_indices == i] = expert(x[flat_expert_indices == i])
y = (y.view(*expert_weights.shape, -1) * expert_weights.unsqueeze(-1)).sum(
dim=1
)
return y.view(*orig_shape)
This is used in Qwen2DecoderLayer in place of the original Qwen2MLP as follows: Original code:
self.mlp = Qwen2MLP(config) # original code
...
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
Modified code:
self.moe = Qwen2MoE(config)
...
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.moe(hidden_states)
hidden_states = residual + hidden_states
I am not an expert, so I cannot guarantee the correctness of this code, but at least the output seems to be working fine.
I believe similar modifications could enable MoE for Phi-2 as well. However, since Qwen2 has similar tensor names to llama2 and mistral, but Phi-2 does not, some additional adjustments to the hardcoded tensor names might be necessary. I hope this comment will help you.
Hey @cg123 - can you perhaps help to implement this?
@PhilipMay did you get through this?
@PhilipMay did you get through this?
No but I have success with the Llamafies phi3 version. See here: https://huggingface.co/PhilipMay/Phi-3-MoE-mini-4k-instruct-raw
When I merge phi-3 I get this error btw:
mergekit-moe --trust-remote-code ./phi3-merge-2.yml phi3-merge-2
configuration_phi3.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.4k/10.4k [00:00<00:00, 14.7MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
ERROR:root:No output architecture found that is compatible with the given models.
ERROR:root:All supported output architectures:
ERROR:root: * Mixtral
ERROR:root: * DeepSeek MoE
ERROR:root: * Qwen MoE
@PhilipMay @316usman @fakerybakery @GiacomoLeoneMaria I was finally able to merge two phi-2 experts, if you are still looking to use the mergekit for this, checkout the phi2xtral branch here: https://github.com/v-prgmr/mergekit/tree/phi2xtral
Special thanks to @Aratako, I used his fork as a reference and tweaked a little bit to make the Phi2MoE work
Thanks @v-prgmr, Please also mention what were your two models finetuned for and what were the results
@316usman I had used two finetuned phi-2 models. Any phi-2 model with layer names matching with the "Llama layers naming convention" works with my fork
When I try to merge it phi-2 to a MoE I get:
My config:
I am on the mixtral branch with the most recent commit.
Can you please help?