arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.69k stars 426 forks source link

Merging two mistral based models with different architectures. Looking for some guidance. #401

Open AshD opened 2 months ago

AshD commented 2 months ago

I want to merge Mistral Large with https://huggingface.co/softwareweaver/Twilight-Miqu-146B by adding some layers from Twilight Miqu to Mistral Large using the passthrough method. Is there a better way to do this?

The merge succeeds when using --allow-crimes but the GGUF model fails to run and so does loading it with transformers GGUF Runtime error: RuntimeError: shape '[96, 2, 42, 8192]' is invalid for input of size 67108864

Transformers loading error: size mismatch for model.layers.151.mlp.gate_proj.weight: copying a param with shape torch.Size([28672, 12288]) from checkpoint, the shape in current model is torch.Size([28672, 8192]

Merge config:

dtype: bfloat16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 43]
    model: mistralai/Mistral-Large-Instruct-2407
- sources:
  - layer_range: [5, 35]
    model: softwareweaver/Twilight-Miqu-146B
- sources:
  - layer_range: [80, 120]
    model: softwareweaver/Twilight-Miqu-146B
- sources:
  - layer_range: [44, 87]
    model: mistralai/Mistral-Large-Instruct-2407
cg123 commented 1 month ago

This is an expected failure. Miqu and Mistral Large have different hidden state sizes so their layers can't be used interchangeably. In general models need to be of the same architecture and family to produce a valid result.