arcee-ai / mergekit

Tools for merging pretrained large language models.
GNU Lesser General Public License v3.0
4.52k stars 396 forks source link

Data point for Dare Ties #26

Open brucethemoose opened 9 months ago

brucethemoose commented 9 months ago

I uploaded 3 different merges, the same in every way except for density, to HF, and interestingly the higher-density merges perform significantly better:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard https://huggingface.co/brucethemoose/CaPlatTessDolXaBoros-Yi-34B-200K-DARE-Ties-HighDensity https://huggingface.co/brucethemoose/CaPlatTessDolXaBoros-Yi-34B-200K-DARE-Ties-ExtremeDensity https://huggingface.co/brucethemoose/CaPlatTessDolXaBoros-Yi-34B-200K-DARE-Ties

Even the "extreme" density config scores far higher than a modest density config. Its perplexity is much lower as well:

Very high density:

models:
  - model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
    # no parameters necessary for base model
  - model: /home/alpha/Storage/Models/Raw/migtissera_Tess-34B-v1.4
    parameters:
      weight: 0.19
      density: 0.83
  - model: /home/alpha//Storage/Models/Raw/bhenrym14_airoboros-3_1-yi-34b-200k
    parameters:
      weight: 0.14
      density: 0.6
  - model: /home/alpha/Storage/Models/Raw/Nous-Capybara-34B
    parameters:
      weight: 0.19
      density: 0.83
  - model: /home/alpha/Storage/Models/Raw/kyujinpy_PlatYi-34B-200K-Q
    parameters:
      weight: 0.14
      density: 0.6
  - model: /home/alpha/FastModels/ehartford_dolphin-2.2-yi-34b-200k
    parameters:
      weight: 0.19
      density: 0.83
  - model: /home/alpha/FastModels/fblgit_una-xaberius-34b-v1beta
    parameters:
      weight: 0.15
      density: 0.08
merge_method: dare_ties
base_model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
parameters:

  int8_mask: true
dtype: bfloat16

"Normal" density:

models:
  - model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
    # no parameters necessary for base model
  - model: /home/alpha/Storage/Models/Raw/migtissera_Tess-34B-v1.4
    parameters:
      weight: 0.19
      density: 0.44
  - model: /home/alpha//Storage/Models/Raw/bhenrym14_airoboros-3_1-yi-34b-200k
    parameters:
      weight: 0.14
      density: 0.34
  - model: /home/alpha/Storage/Models/Raw/Nous-Capybara-34B
    parameters:
      weight: 0.19
      density: 0.44
  - model: /home/alpha/Storage/Models/Raw/kyujinpy_PlatYi-34B-200K-Q
    parameters:
      weight: 0.14
      density: 0.34
  - model: /home/alpha/FastModels/ehartford_dolphin-2.2-yi-34b-200k
    parameters:
      weight: 0.19
      density: 0.44
  - model: /home/alpha/FastModels/fblgit_una-xaberius-34b-v1beta
    parameters:
      weight: 0.15
      density: 0.08
merge_method: dare_ties
base_model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
parameters:

int8_mask: true
dtype: bfloat16
hahuyhoang411 commented 9 months ago

Thank you. Wonderful findings. Can I ask about your hardware to run this yaml? and do you use --cuda?

discordianbelle commented 9 months ago

Can confirm, athirdpath/Iambe-RP-DARE-20b-DENSE dramatically outperformed Iambe-RP-DARE-20b, with an average density of ~0.50 instead of ~0.25. Also, Iambe-v2-DARE gained a lot from going up to 0.66 density for noromaid, so this is reproducible.

brucethemoose commented 9 months ago

Thank you. Wonderful findings. Can I ask about your hardware to run this yaml? and do you use --cuda?

This was run on a 32GB RAM desktop with a lot of swap, lol.

I can merge 4x 34B models with --cuda (on an 24GB GPU) with about 20GB of VRAM usage.

I think usage largely depends on shard size (smaller shards result in lower usage), and whether one of the weights being merged are saved as .bin files instead of .safetensors.

cg123 commented 9 months ago

This is great data to have - thanks for running this experiment and sharing your results like this.

Higher density definitely seems like the strategy to use for this kind of merge. I've kicked my go-to values up to match yours and the merges are definitely more consistently good.

One thing I've noticed with experiments on smaller models is that the randomness in DARE can give you a pretty big range of results. I've seen a difference of as much as 15% in evaluation loss when repeatedly running the same low-density merge of 6-layer classifiers. It could be that it's possible (but just unlikely) to get a really good low density merge. Or maybe the kind of stuff we want from a chat model just can't be captured as well in a low density delta.

Thanks for the data, and for putting more knowledge on how to make good merges out in the open.

brucethemoose commented 9 months ago

Oh yeah, I didn't even consider the non determinism of the DARE merges.

Also note that the highest scoring model was actually around 0.5 - 0.6, but the evaluation of the 0.6-0.83 merge was not far behind, so maybe the sweetspot is somewhere around there?:

models:
  - model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
    # no parameters necessary for base model
  - model: /home/alpha/Storage/Models/Raw/migtissera_Tess-34B-v1.4
    parameters:
      weight: 0.19
      density: 0.6
  - model: /home/alpha//Storage/Models/Raw/bhenrym14_airoboros-3_1-yi-34b-200k
    parameters:
      weight: 0.14
      density: 0.5
  - model: /home/alpha/Storage/Models/Raw/Nous-Capybara-34B
    parameters:
      weight: 0.19
      density: 0.6
  - model: /home/alpha/Storage/Models/Raw/kyujinpy_PlatYi-34B-200K-Q
    parameters:
      weight: 0.14
      density: 0.5
  - model: /home/alpha/FastModels/ehartford_dolphin-2.2-yi-34b-200k
    parameters:
      weight: 0.19
      density: 0.6
  - model: /home/alpha/FastModels/fblgit_una-xaberius-34b-v1beta
    parameters:
      weight: 0.15
      density: 0.08
merge_method: dare_ties
base_model: /home/alpha/Storage/Models/Raw/chargoddard_Yi-34B-200K-Llama
parameters:
  int8_mask: true
dtype: bfloat16