The DARE-TIES experiment.

David-AU-github commented 2 weeks ago

I just wanted to pass on some "lab" results using dare-ties and mistral nemo.

I created a triple dare-ties merge of 3 pass-through "instruct/fine" models.

Each instruct/fine tune uses the same merge format:

slices:

sources:
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B layer_range: [0, 14]
sources:
- model: G:/11B/Rocinante-12B-v1.1 layer_range: [8, 24] parameters: scale:
  - filter: o_proj value: 1
  - filter: down_proj value: 1
  - value: 1
sources:
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B layer_range: [14, 22] parameters: scale:
  - filter: o_proj value: .5
  - filter: down_proj value: .5
  - value: 1
sources:
- model: g:/11b/Mistral-Nemo-Instruct-2407-12B layer_range: [22, 31] parameters: scale:
  - filter: o_proj value: .75
  - filter: down_proj value: .75
  - value: 1
sources:
- model: G:/11B/Rocinante-12B-v1.1 layer_range: [24, 40] parameters: scale:
  - filter: o_proj value: 1
  - filter: down_proj value: 1
  - value: 1 merge_method: passthrough dtype: bfloat16

THE DARE-TIES:

models:

model: E:/MN-Rocinante-12B-v1.1-Instruct
model: E:/MN-magnum-v2.5-12b-kto-Instruct parameters: weight: .6 density: .8
model: E:/MN-12B-Celeste-V1.9-Instruct parameters: weight: .38 density: .6 merge_method: dare_ties tokenizer_source: union base_model: E:/MN-Rocinante-12B-v1.1-Instruct dtype: bfloat16

What is interesting here is that EACH TIME I run the "dare-ties" it creates a slightly different or VERY DIFFERENT model, despite no changes in the the models nor the settings.

This shows up in PPL and "real world" tests. PPL range of 7.7327 to 7.8024 ... and that is on just 10 generations.

Real world testing the "core" changes -> wow. Attibute, scale, word choice, sentence structure,... changes across the board.

I am not sure if this is a mistral nemo artifact or not.

From these 10, I did some merging of these using breadcrumbs ; wow. All I can say.

When everything is F32 ... they shine even brighter.

With enough generations + merging of the "best DNA" could create truly legendary model(s).

Just saying - job well done and then some!!!

NOTE: Models for "fine/instruct" and "DARE-TIES" supermerges are posted at my repo.

CasualDev242 commented 2 weeks ago

If DARE-Ties gives dramatically different results each time, maybe I don't understand it correctly, but that sounds less like a good thing and more like a bad thing.

David-AU-github commented 2 weeks ago

If DARE-Ties gives dramatically different results each time, maybe I don't understand it correctly, but that sounds less like a good thing and more like a bad thing.

This all depends... in my first case it was bad, because I deleted the source and found out the hard way... and it was a great version. That being said, in creating 10+ versions, the "Dna" of each model can be mapped, and these combined creating stronger models with specific attributes while reducing the negative ones.

One of the open questions is: Does this apply to other archs too? Llama2? 3? 3.1? ... And some of the other mergekit methods also involve this same type of "random pruning"... too. I mapped these out after looking at the programming code to verify operations.

A more interesting method or change may be pruning controls for DARE TIES , which limit the range.

cg123 commented 2 weeks ago

Thanks for sharing your results here!

DARE-TIES does have a randomized element, yeah - it's part of the algorithm by design. If you want more reproducible merges you can set a random seed by passing --random-seed <N> on the command line. I usually do when I'm iterating on a recipe that involves DARE.

David-AU-github commented 1 week ago

Thanks for sharing your results here!

DARE-TIES does have a randomized element, yeah - it's part of the algorithm by design. If you want more reproducible merges you can set a random seed by passing --random-seed <N> on the command line. I usually do when I'm iterating on a recipe that involves DARE.

*** Thank you ; that was one of the questions I had ; thanks again ... I think there is so much untapped potential in mergekit yet to be discovered.

arcee-ai / mergekit

The DARE-TIES experiment. #411