arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.83k stars 439 forks source link

Why two different options generate different size of models? #49

Closed DopeorNope-Lee closed 10 months ago

DopeorNope-Lee commented 11 months ago

Option 1

slices:
  - sources:
    - model: AIDC-ai-business/Marcoroni-7B-v3
      layer_range: [0, 24]
  - sources:
    - model: Toten5/Marcoroni-neural-chat-7B-v2
      layer_range: [8, 32]
merge_method: passthrough
dtype: bfloat16

Option 2

slices:
  - sources:
      - model: AIDC-ai-business/Marcoroni-7B-v3
        layer_range: [0, 24]
      - model: Toten5/Marcoroni-neural-chat-7B-v2
        layer_range: [8, 32]
merge_method: slerp
base_model: AIDC-ai-business/Marcoroni-7B-v3
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]

    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5 

dtype: float16

I generated two models using different config options.

The first one generated a 10.7B size model, however, the second one generated a 5.5B size model.

I applied the same base models and layer information, but it yielded different results.

Anyone could explain why these different results have occurred?

Moreover, in slerp merge, there are any other options on the parameter (especially in the filter, mlp or self-attn, or others?)?

Thanks!

cg123 commented 10 months ago

The difference is that in your first config, you're defining two output slices:

slices:
  - sources: # output slice #1
    - model: AIDC-ai-business/Marcoroni-7B-v3
      layer_range: [0, 24]
  - sources: # output slice #2
    - model: Toten5/Marcoroni-neural-chat-7B-v2
      layer_range: [8, 32]

These simply get stacked on top of each other, giving you a final model with 40 layers instead of the 32 that a 7B model has. In your second config, you're defining a single output slice that combines two input slices:

slices:
  - sources: # output slice #1
      - model: AIDC-ai-business/Marcoroni-7B-v3 # input slice #1
        layer_range: [0, 24]
      - model: Toten5/Marcoroni-neural-chat-7B-v2 # input slice #2
        layer_range: [8, 32]

The two input slices will be combined using the merge method you specified (slerp here.) That means that layer 0 of AIDC-ai-business/Marcoroni-7B-v3 will be SLERP merged with layer 8 of Toten5/Marcoroni-neural-chat-7B-v2, layer 1 with layer 9, and so on. The end result is the size of your input slices, so just 24 layers.

As for the filter options - filter works by searching for the substring you specify in the tensor name, so it depends on the architecture you're merging. If you want to know all of the tensor names in a Mistral model you can see a list on huggingface here.

Hope this helps!