arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.8k stars 437 forks source link

What is the density for TIES when using DARE by setting `merge_method: dare_ties`? #130

Open syGOAT opened 9 months ago

syGOAT commented 9 months ago

Congratulations on the significant breakthrough achieved in model merging! I'd like to ask you a question. I use dare_ties to merge some models. Here is my yaml file:

models:
  - model: mistralai/Mistral-7B-v0.1
    # No parameters necessary for base model
  - model: samir-fama/SamirGPT-v1
    parameters:
      density: 0.53
      weight: 0.4
  - model: abacusai/Slerp-CM-mist-dpo
    parameters:
      density: 0.53
      weight: 0.3
  - model: EmbeddedLLM/Mistral-7B-Merge-14-v0.2
    parameters:
      density: 0.53
      weight: 0.3
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  int8_mask: true
dtype: bfloat16

I believe the 'density' here refers to the delta parameters randomly retained by DARE. What is the density during the TIES stage after using DARE? Is it the same as the density of DARE, or is there a specific method for setting it?

cg123 commented 9 months ago

Thank you! I'm glad people are finding it so useful.

That's exactly correct - density is the fraction of delta parameters retained. When using DARE with TIES there isn't a second sparsification applied, it simply performs the sign-consensus step from TIES to the sparsified, scaled delta given by DARE. So the density will be the same throughout.

syGOAT commented 9 months ago

Thanks for your explanation! Maybe I've understood your point. In the original paper of TIES-Merging: https://arxiv.org/abs/2306.01708, TIES contains 3 steps: Trim, Elect, Disjoint Merge. Trim keeps the top k%. Elect creates a sign vector. Disjoint Merge computes a disjoint mean. In mergekit, dare_ties means using DARE in place of Trim, with the other steps(Elect, Disjoint Merge) remaining the same.