arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.69k stars 426 forks source link

Merging Two Models of Different Architectures #42

Open fakerybakery opened 10 months ago

fakerybakery commented 10 months ago

Hi, Might it be possible to merge Mistral and Llama? Thank you!

brucethemoose commented 10 months ago

If the model is precisely the same size, theoretically this is what git-rebasin can do. There is an older branch for it.

cg123 commented 10 months ago

In principle it's possible to merge a Mistral model and a Llama model if they are the same size (hidden dimensions, number of layers, attention/KV heads). Unfortunately Mistral 7B is a different configuration from Llama.

There are more advanced techniques out there that have had success merging models of different sizes - see otfusion for example. I'm looking into adding the infrastructure necessary to implement techniques along these lines but that's a ways off in the future.

hasan9090 commented 8 months ago

@cg123 How is the example ties config possible then? Afaik it merges orca mini which is llama2 based with wizwardmath (mistral-based). Similar I don't get why from this page the mixing of architectures with Mistral-7b, WizardMath-7b and CodeLlama-7b with TIES is even possible: https://slgero.medium.com/merge-large-language-models-29897aeb1d1a

Either there was an update meanwhile or I am missing something? I am trying to merge Mistral-7b-Instruct with ToolLama but no success yet with linear and TIES. Liner is clear because it checks for tensor equality but why would TIES not work? I get "RuntimeError: CUDA error: uncorrectable ECC error encountered" .

brucethemoose commented 8 months ago

That article is probably wrong then?

cg123 is working on integrating the infrastructure for git-rebasin merging, which should theoretically enable what you want @hasan9090. This method has already proved itself, to some degree, with Stable Diffusion

hasan9090 commented 8 months ago

I don't think the articles are all wrong. I found that there is a hard restriction in linear merge method code to use the same tensor sizes so architectures have to be a 1:1 match. However with other methods like TIES analogously to the authors of these articles, I was also able to merge different architecture models. There are warnings in the output while running the script though telling that parameters and which ones have been skipped for merging from which specific model. So it seems to be possible, question is what really happens when the tensor sizes don't match e.g. for a LLama2 Mistral merge. Llama2 has hidden dim of 11k vs Mistral 13k. My assumption is that it will just skip the 2k extra parameters from the Mistral model in merging but it would be great to have some more detailed explanation of this if possible by @cg123 .