Tools for merging pretrained large language models.
Need some help in merging same architectures, but with different tokens in their tokenizers #342

Open choprahetarth opened 3 months ago

choprahetarth commented 3 months ago

Hello! I actually have two models - CodeLLaMa-13b-Python and CodeLLaMa-13b, that need to be merged. The overall goal is to merge two models (one trained on Python and another trained on any other language). However, the biggest problem that I am facing is this -

Traceback (most recent call last):
  File "/u/choprahetarth/all_files/model_merging/", line 22, in <module>
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/", line 92, in run_merge
    for _task, value in
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/", line 197, in run
    res = task.execute(**arguments)
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/merge_methods/", line 85, in execute
    expanded = torch.stack(expanded, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [32000, 5120] at entry 0 and [32016, 5120] at entry 1

Now the YAML file I have used to merge looks like this -

  - model : meta-llama/CodeLlama-13b-Python-hf
      density: 0.5 # density gradient
        - filter: embed_tokens
          value: 0
        - value: 1
  - model: meta-llama/CodeLlama-13b-hf
      density: 0.5 # density gradient
        - filter: embed_tokens
          value: 0
        - value: 1
tokenizer_source: union
merge_method: dare_ties
base_model: meta-llama/CodeLlama-13b-hf
  density: 0.5 # density gradient
    - filter: embed_tokens
      value: 0
    - value: 1
  normalize: true
  int8_mask: true
dtype: float32

As far as I can see the problem arises when I merge the two models since their total number of tokens (and leading to the embedding layer size) is different, regardless of the other layers being same.

Model 1 (python) - 
Layer: embed_tokens.weight | Size: torch.Size([32016, 5120])

Base Model - 
Layer: embed_tokens.weight | Size: torch.Size([32000, 5120])

How do I make sure that I IGNORE this layer, and keep everything else as is?

Also, somehow the model-stock method does work well. Just not sure why.

Other than the slight difference in the embedding layer size (which likely corresponds to a different vocabulary/token size), all the other layer dimensions are exactly the same between the two models:

choprahetarth commented 3 months ago
  - sources:
    - model : meta-llama/CodeLlama-13b-Python-hf
      layer_range: [2, 39]
    - model: meta-llama/CodeLlama-13b-hf
      layer_range: [2, 39]
tokenizer_source: union
merge_method: slerp
base_model: meta-llama/CodeLlama-13b-hf
layer_range: [2, 39]
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5 # fallback for rest of tensors
  normalize: true
  int8_mask: true
dtype: float32 

Also, I have tried this sort of configuration as well. Got the same results.

cg123 commented 3 months ago

Could you please try this merge using the branch from #334? I believe it should fix this.

choprahetarth commented 3 months ago

Thank you so much Charles (and for the amazing library as well!). However, I am getting this particular error after checking out on the tokenizer_again branch -

Executing graph:   0%|          | 0/1820 [00:00<?, ?it/s]WARNING:root:Token '▁<EOT>' present in meta-llama/CodeLlama-13b-Python-hf tokenizer but >= vocab_size
WARNING:root:Token '▁<MID>' present in meta-llama/CodeLlama-13b-Python-hf tokenizer but >= vocab_size
WARNING:root:Token '▁<PRE>' present in meta-llama/CodeLlama-13b-Python-hf tokenizer but >= vocab_size
WARNING:root:Token '▁<SUF>' present in meta-llama/CodeLlama-13b-Python-hf tokenizer but >= vocab_size

Building tokenizer permutations:   0%|          | 0/2 [00:00<?, ?it/s]WARNING:root:meta-llama/CodeLlama-13b-Python-hf token '▁<EOT>' has index 32003>31999 (padding?)
WARNING:root:meta-llama/CodeLlama-13b-Python-hf token '▁<PRE>' has index 32000>31999 (padding?)
WARNING:root:meta-llama/CodeLlama-13b-Python-hf token '▁<SUF>' has index 32002>31999 (padding?)
WARNING:root:meta-llama/CodeLlama-13b-Python-hf token '▁<MID>' has index 32001>31999 (padding?)

Building tokenizer permutations: 100%|██████████| 2/2 [00:00<00:00, 12.86it/s]
Building tokenizer permutations: 100%|██████████| 2/2 [00:00<00:00, 12.85it/s]

Executing graph:   0%|          | 2/1820 [00:01<20:18,  1.49it/s]
Executing graph:   0%|          | 3/1820 [00:05<1:07:27,  2.23s/it]
Executing graph:   0%|          | 4/1820 [00:09<1:20:57,  2.67s/it]
Executing graph:   0%|          | 5/1820 [00:09<55:35,  1.84s/it]  
Traceback (most recent call last):
  File "/u/choprahetarth/all_files/model_merging/", line 22, in <module>
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/", line 95, in run_merge
    for _task, value in
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/", line 197, in run
    res = task.execute(**arguments)
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/tokenizer/", line 63, in execute
    tokens_to_average = self.assign_embedding_sources(
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/tokenizer/", line 127, in assign_embedding_sources
    has_token = [p[token_id] >= 0 for p in permutation_list]
  File "/u/choprahetarth/all_files/model_merging/mergekit/mergekit/tokenizer/", line 127, in <listcomp>
    has_token = [p[token_id] >= 0 for p in permutation_list]
KeyError: 32010
srun: error: gpub002: task 0: Exited with exit code 1

The config used is ->

  - model : meta-llama/CodeLlama-13b-Python-hf
      density: 0.5 # density gradient
        - filter: embed_tokens
          value: 0
        - value: 1
  - model: meta-llama/CodeLlama-13b-hf
      density: 0.5 # density gradient
        - filter: embed_tokens
          value: 0
        - value: 1
tokenizer_source: union
merge_method: dare_ties
base_model: meta-llama/CodeLlama-13b-hf
  density: 0.5 # density gradient
    - filter: embed_tokens
      value: 0
    - value: 1
  normalize: true
  int8_mask: true
dtype: float32
choprahetarth commented 3 months ago

Okay so only possible workaround that I have (somehow) been able to use is to manually reshape the model's embedding layer with huggingface ->

import transformers
from transformers import LlamaForCausalLM

import torch

model = "meta-llama/CodeLlama-13b-hf"
print("================================FIRST MODEL STARTS HERE=================================")
tokenizer_normal = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
# Load the model
model_normal = LlamaForCausalLM.from_pretrained(model)

# Print the size of all layers in the model
for name, param in model_normal.named_parameters():
    print(f"Layer: {name} | Size: {param.size()}")

print("============================SECOND MODEL STARTS HERE====================================")

model = "meta-llama/CodeLlama-13b-Python-hf"
tokenizer_python = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
# Load the model
model_python = LlamaForCausalLM.from_pretrained(model)

# Print the size of all layers in the model
for name, param in model_python.named_parameters():
    print(f"Layer: {name} | Size: {param.size()}")
# Check if the model's architecture is LlamaForCausalLM before pushing
if model_python.config.architectures[0] == "LlamaForCausalLM":
    # Push the second model to Hugging Face Hub
    print("The model's architecture is not LlamaForCausalLM. Not pushing to hub.")
# Push the second model to Hugging Face Hub
# model_python.push_to_hub("codellama-13b-hf-truncated-embeddings")

and then use this ->

  - model : choprahetarth/codellama-13b-hf-truncated-embeddings
      density: 0.5 # density gradient
        - value: 1
  - model: meta-llama/CodeLlama-13b-hf
      density: 0.5 # density gradient
        - value: 1
tokenizer_source: union
merge_method: dare_ties
base_model: meta-llama/CodeLlama-13b-hf
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    # - filter: embed_tokens
    #   value: 0 # use lm_head from 8b_stage2_final
    - value: 0.5
  embed_slerp: true 
  normalize: true
  int8_mask: true
dtype: float16

to merge them together.

@cg123 I was wondering where exactly should I add this in my code (within mergekit, and possibly make a PR/branch as well). The library is wayyy too complex for me to wrap my head around without documentation (as much as I respect you for writing it, I mean, it is amazing!). Just a small direction on where I could add this as a contribution would be nice!