This PR introduces a way to merge two models via their activations and hidden states on a tiny sample of data.
This method uses these activations and hidden states to form correlation matrices to then generate permutation and inverse permutation matrices for weights in each model and then combines them
This PR consists of three main scripts
the first one generates the activation/hidden state for each space
a permutation and inverse permutation pair is generated for each space
based on each space and the connected weights, the permutation and/or inverse permutation is applied to each weight and then the weights are combined
Assumptions
The models to be merged are of the same architecture and equal block/layer count
Testing
To test this we need to get the mergekit/scripts/random_permuter.py script from the branch rope-alignment
(see below the bash stuff for the final inference script i.e test_by_gen.py)
import sys
import torch
from transformers import pipeline
model = sys.argv[1]
pipe = pipeline(
"text-generation", model=model, torch_dtype=torch.bfloat16, device_map="auto"
)
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{
"role": "system",
"content": "You are a helpful chatbot who pretends to be Richard Feynman",
},
{"role": "user", "content": "Could you tell me about the challenger disaster ?"},
]
prompt = pipe.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
outputs = pipe(
prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95
)
print(outputs[0]["generated_text"])
If all goes well, you should see the following (or something along the lines of the following)
Things that couldn't make into the final PR
on-the-fly handling of models with grouped query attention. This hasn't been tested enough for this release but will be in the near future. For now, users will have to resort to using this script first:
Note:
Because this was copied over from another branch (wip-zipit) @shamanez 's contributions to the PR is missing, so this is explicit acknowledgement that @shamanez has worked on this PR alongside other authors
What is this?
This PR introduces a way to merge two models via their activations and hidden states on a tiny sample of data. This method uses these activations and hidden states to form correlation matrices to then generate permutation and inverse permutation matrices for weights in each model and then combines them
This PR consists of three main scripts
Assumptions
The models to be merged are of the same architecture and equal block/layer count
Testing
To test this we need to get the
mergekit/scripts/random_permuter.py
script from the branchrope-alignment
(see below the bash stuff for the final inference script i.e
test_by_gen.py
)(test_by_gen.py)
If all goes well, you should see the following (or something along the lines of the following)
Things that couldn't make into the final PR
on-the-fly handling of models with grouped query attention. This hasn't been tested enough for this release but will be in the near future. For now, users will have to resort to using this script first:
Note:
Because this was copied over from another branch (
wip-zipit
) @shamanez 's contributions to the PR is missing, so this is explicit acknowledgement that @shamanez has worked on this PR alongside other authors