arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.41k stars 388 forks source link

Error merging phi-2 to MoE #145

Open PhilipMay opened 7 months ago

PhilipMay commented 7 months ago

When I try to merge it phi-2 to a MoE I get:

$ mergekit-moe phi2_moe2.yml out_phi2_moe
Fetching 10 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11278.04it/s]
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 8683.86it/s]
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 7469.82it/s]
Warm up loaders: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.55it/s]
Traceback (most recent call last):
  File "/users/philip/miniconda3/envs/mergekit/bin/mergekit-moe", line 8, in <module>
    sys.exit(main())
  File "/users/philip/miniconda3/envs/mergekit/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/users/philip/miniconda3/envs/mergekit/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/users/philip/miniconda3/envs/mergekit/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/users/philip/miniconda3/envs/mergekit/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/users/philip/code/git/mergekit/mergekit/options.py", line 76, in wrapper
    f(*args, **kwargs)
  File "/users/philip/code/git/mergekit/mergekit/scripts/mixtral_moe.py", line 453, in main
    build(
  File "/users/philip/code/git/mergekit/mergekit/scripts/mixtral_moe.py", line 326, in build
    tensor = base_loader.get_tensor(tensor_name)
  File "/users/philip/code/git/mergekit/mergekit/io/lazy_tensor_loader.py", line 127, in get_tensor
    raise KeyError(key)
KeyError: 'model.norm.weight'

My config:

#$ less phi2_moe2.yml 
base_model: microsoft/phi-2
gate_mode: random # one of "hidden", "cheap_embed", or "random"
#dtype: float16 # output dtype (float32, float16, or bfloat16)
experts:
  - source_model: microsoft/phi-2
    positive_prompts: []
  - source_model: microsoft/phi-2
    positive_prompts: []

I am on the mixtral branch with the most recent commit.

Can you please help?

GiacomoLeoneMaria commented 7 months ago

customisation should be required, I don't think phi is currently supported.

NicolasMejiaPetit commented 7 months ago

https://huggingface.co/mlabonne/phixtral-4x2_8 mlabonne said he used mergekit, but i am running into the same error, even when using the fork on his github repo.

PhilipMay commented 6 months ago

https://huggingface.co/mlabonne/phixtral-4x2_8 mlabonne said he used mergekit, but i am running into the same error, even when using the fork on his github repo.

I think this is from an old mixtral version. When I try his exact config I get this error:

ERROR:root:Your positive and negative prompts are identical for all experts. This will not produce a functioning MoE.
ERROR:root:For each expert, `positive_prompts` must contain one or more example prompt reflecting what should be routed to that expert.
PhilipMay commented 6 months ago

I found a fix for this:

Changing this line: https://github.com/arcee-ai/mergekit/blob/d0f5ad466ea9caaf3c997f27e1695a32d68e147f/mergekit/scripts/mixtral_moe.py#L324

to this:

    MISTRAL_INFO = mergekit.architecture.PHI2_INFO_AGAIN_BUT_DIFFERENT

Solves the problem. It seems like the architecture is hard wired.

But how does a clean fix look like? @cg123

PhilipMay commented 6 months ago

@NickWithBotronics this fix is even better: #150

NicolasMejiaPetit commented 6 months ago

@NickWithBotronics this fix is even better: #150

Thank you so much! I can work on a project Idea I had, and use phi like I intended, not minichat 3b. Gratitude +100.

Also you are remarkably quick, during my research, this has basically been a issue/discussion on getting this to work since phi was released.

PhilipMay commented 6 months ago

Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)

NicolasMejiaPetit commented 6 months ago

Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)

Will do! As far as I know mlabonne, used a “custom merge kit(mixtral)” I was able to find a fork on his github for it, but it didn’t work when I tried it, so I do believe it was a fix he did, and possibly kept it private?

NicolasMejiaPetit commented 6 months ago

Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)

Unfortunately, I have some bad news. Tested it out and got a non moe safe tensor file, I know cause its the same size as the regular phi. but i got a config file, merge config file, and merges file. No code stopping errors though.

PhilipMay commented 6 months ago

Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)

Unfortunately, I have some bad news. Tested it out and got a non moe safe tensor file, I know cause its the same size as the regular phi. but i got a config file, merge config file, and merges file. No code stopping errors though.

So you say the model was not really merged?

NicolasMejiaPetit commented 6 months ago

image exactly, does it matter what code phi is using? i can test it out tmr, i used microsoft/phi2 and i have 2 snapshots installed no clue which it used. or what changes were made on those snapshots.

PhilipMay commented 6 months ago

@NickWithBotronics a bit offtopic: For research you could use tiny-llama instead of phi-2. That should work 100%.

NicolasMejiaPetit commented 6 months ago

@NickWithBotronics a bit offtopic: For research you could use tiny-llama instead of phi-2. That should work 100%.

I saw minichat, got better benchmarks , however I haven't real world tested it, however, I could split my dataset by 1/10 the size for small testing with tiny llama before going ham with the full dataset.

cariad-v commented 6 months ago

image exactly, does it matter what code phi is using? i can test it out tmr, i used microsoft/phi2 and i have 2 snapshots installed no clue which it used. or what changes were made on those snapshots.

@NickWithBotronics so did you test the merged model now? does the merge-moe from the mixtral branch work for phi2 architecture?

fakerybakery commented 6 months ago

@mlabonne is your Phixtral fork public?

mlabonne commented 6 months ago

@fakerybakery Still not unfortunately, need to find some time to work on it.

Aratako commented 6 months ago

Well, I am not 100% sure if this really works. It just does not raise an exception anymore. If you have first results of a Phi-2 MoE model please let me know. :-)

Unfortunately, I have some bad news. Tested it out and got a non moe safe tensor file, I know cause its the same size as the regular phi. but i got a config file, merge config file, and merges file. No code stopping errors though.

@NickWithBotronics I think you might need to tweak these lines. https://github.com/arcee-ai/mergekit/blob/dbb2eebf4ff0c21bda6069cf6de0e3e3b249f82e/mergekit/scripts/mixtral_moe.py#L337-L363

This code checks whether ".mlp." is in tensor_name, and if it is, it adds a new tensor from the expert models by replacing tensor name. However, the weight names for the Phi-2 model's MLP layers, such as "model.layers.0.mlp.fc1.bias", "model.layers.0.mlp.fc1.weight", "model.layers.0.mlp.fc2.bias", and "model.layers.0.mlp.fc2.weight", mean that expert_name won't be replaced and remains the same as the original tensor_name. This discrepancy likely leads to the new weights not being added as intended, but rather to the overwriting of existing weights, so final model size remains the same as the original.

v-prgmr commented 6 months ago

@Aratako so change line 338 by looking into PHI2_INFO_AGAIN_BUT_DIFFERENT instead of MISTRAL_INFO?

Aratako commented 6 months ago

@v-prgmr Yes, I think that would be the correct approach. Though, the repository's structure has recently undergone changes, and the architecture definitions are now located in this JSON file: mergekit/_data/architectures/phi2.json

You would likely need to adjust the replace statements to align with the naming conventions of the Phi-2 model weights. (I'm not entirely sure if this is the correct approach, or how exactly to rewrite them, though.)

316usman commented 5 months ago

Has anyone been success ful in merging any modl with Phi2 or itself even. Ive tried LaMini and Qwen 1.5 and bith show the error KeyError: 'model.embed_tokens.weight'

v-prgmr commented 5 months ago

@cg123 maybe can you give me a hint or two on what needs to be done for phi-2, I can then try to work on fixing this and submit a PR. At the moment, I don't know the head and tail of the mergekit-moe

Aratako commented 5 months ago

Hello, I have not worked with the MoE model for Phi-2, but I have created an MoE model for Qwen2. The code for merging is available in this repository: https://github.com/Aratako/mergekit-qwen2. The script for merging Qwen2 can be found at mergekit/scripts/qwen2_moe.py (I have also made adjustments to pyproject.toml so it can be executed from the command line).

The MoE models created using this script are published here:

To use the MoE models outputted by this script, customized modeling and configuration files are required. In my case, I've added MoE-related classes to the modeling_qwen2.py and configuration_qwen2.py from the transformers library. My implementation was inspired by what's done in https://huggingface.co/mlabonne/phixtral-4x2_8. Specifically, I have added the following class:

class Qwen2MoE(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_local_experts = config.num_local_experts
        self.num_experts_per_tok = config.num_experts_per_tok
        self.mlp = nn.ModuleList(
            [Qwen2MLP(config) for i in range(self.num_local_experts)]
        )
        self.gate = nn.Linear(self.hidden_size, self.num_local_experts, bias=False)

    def forward(self, x):
        orig_shape = x.shape
        x = x.view(-1, x.shape[-1])

        scores = self.gate(x)
        expert_weights, expert_indices = torch.topk(
            scores, self.num_experts_per_tok, dim=-1
        )
        expert_weights = expert_weights.softmax(dim=-1)
        flat_expert_indices = expert_indices.view(-1)

        x = x.repeat_interleave(self.num_experts_per_tok, dim=0)
        y = torch.empty_like(x)
        for i, expert in enumerate(self.mlp):
            y[flat_expert_indices == i] = expert(x[flat_expert_indices == i])
        y = (y.view(*expert_weights.shape, -1) * expert_weights.unsqueeze(-1)).sum(
            dim=1
        )
        return y.view(*orig_shape)

This is used in Qwen2DecoderLayer in place of the original Qwen2MLP as follows: Original code:

self.mlp = Qwen2MLP(config) # original code
...
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states

Modified code:

self.moe = Qwen2MoE(config)
...
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.moe(hidden_states)
hidden_states = residual + hidden_states

I am not an expert, so I cannot guarantee the correctness of this code, but at least the output seems to be working fine.

I believe similar modifications could enable MoE for Phi-2 as well. However, since Qwen2 has similar tensor names to llama2 and mistral, but Phi-2 does not, some additional adjustments to the hardcoded tensor names might be necessary. I hope this comment will help you.

PhilipMay commented 5 months ago

Hey @cg123 - can you perhaps help to implement this?

v-prgmr commented 4 months ago

@PhilipMay did you get through this?

PhilipMay commented 4 months ago

@PhilipMay did you get through this?

No but I have success with the Llamafies phi3 version. See here: https://huggingface.co/PhilipMay/Phi-3-MoE-mini-4k-instruct-raw

PhilipMay commented 3 months ago

When I merge phi-3 I get this error btw:

mergekit-moe --trust-remote-code ./phi3-merge-2.yml phi3-merge-2
configuration_phi3.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.4k/10.4k [00:00<00:00, 14.7MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
ERROR:root:No output architecture found that is compatible with the given models.
ERROR:root:All supported output architectures:
ERROR:root:  * Mixtral
ERROR:root:  * DeepSeek MoE
ERROR:root:  * Qwen MoE
v-prgmr commented 2 months ago

@PhilipMay @316usman @fakerybakery @GiacomoLeoneMaria I was finally able to merge two phi-2 experts, if you are still looking to use the mergekit for this, checkout the phi2xtral branch here: https://github.com/v-prgmr/mergekit/tree/phi2xtral

Special thanks to @Aratako, I used his fork as a reference and tweaked a little bit to make the Phi2MoE work

316usman commented 1 month ago

Thanks @v-prgmr, Please also mention what were your two models finetuned for and what were the results

v-prgmr commented 1 month ago

@316usman I had used two finetuned phi-2 models. Any phi-2 model with layer names matching with the "Llama layers naming convention" works with my fork