Open jukofyork opened 5 months ago
Finished the evaluations: both miqu-96b and miqu-120b have a poor score and are brain damaged
I've created 10 prompts and will test all 54 combinations to see if this helps the alternating layer goliath
type merges.
I've already found that just switching the order of the alternating layers:
Xwin-LM-70B-V0.1
--> Euryale-1.3-L2-70B
vs
Euryale-1.3-L2-70B
--> Xwin-LM-70B-V0.1)
seems to make a difference to the quality of the stories.
So changed to use these as anything using o_proj > 1
or down_proj > 1
just created gibbering idiots:
Euryale
and Xwin-LM
for the inner layers.q_proj = k_proj
∈{sqrt(1/2), sqrt(3/4), 1.}
.o_proj
∈ {1/2, 3/4, 1}
.down_proj
∈ {1/2, 3/4, 1}
.I've only got through the first 15 out of 54 so far but there is a pattern developing, especially with this prompt:
Write me a short story about the early life of Anton Chigurh. It should be written in first person, be set in his home country and feature a grenade.
o_proj = down_proj = 1
causes really bad/short stories.o_proj = 0.5
, and down_proj = 1
or down_proj = 0.75
causes repetition problems.o_proj = 0.5
and down_proj = 0.5
, or o_proj = 0.75
and down_proj = 0.75
is somewhat coherent but dumb.goliath[120n
!) think it's East-European or Russian (!!!).o_proj = 1
, and down_proj = 0.5
seems to be the most promising so far.I find the last bit most interesting, but need to run the full test to be sure... It could be that the MLP layer (with it's norm before) expects the distribution caused by the o_proj = 1
setting and then transforms it to put back into the 'residual stream' (but then we just put back 1/2 of what we would do without repeating layers via down_proj = 0.5
).
EDIT: Just noticed a lot of the other 9 prompts with o_proj = 1
, and down_proj = 0.5
are just instantly wrtting the <EOS>
token and quitting though.
This seems a good test prompt too:
Write me a science fiction story set in the same universe as the 1972 film 'Silent Running'. It should feature similar themes of ecological destruction, murder for the "greater good" and autonomous robots.
The dumber the model the more likely they are to try to just write a scene-play instead of a story...
It definitely looks like o_proj = 1
, and down_proj = 0.5
looks to be best, but it either produces nothing at all or something good. It could be the q_proj = k_proj = sqrt(1/2)
is too low and it's double-scaling or something, so will just have to leave to run it's course (at this rate probably around 3 days).
Still running, but I think there is a pretty clear pattern for the Xwin-LM
--> Euryale
alternating layers merge:
q_proj = k_proj < 1
or o_proj < 1
seem to have very similar effects and just seem to make the model dumber and write shorter stories... At the extremes it causes the model to just write the <EOS>
token and quit.down_proj < 1
increases the length of the stories, makes the model use better language and also introduces interesting "novelties" to the stories.down_proj
too much causes the model to start repeating and/or go insane (sometimes it will even start to prepend its own instructions for the story before starting to write!).So far q_proj = k_proj = o_proj = 1
(ie: left unchanged) and down_proj = 0.75
consistently produce the best results.
I'm going to leave it to finish to have both orderings (ie: also Euryale
--> Xwin-LM
alternating) for comparison.
I may try q_proj = k_proj > 1
to see what that does afterwards (pretty sure o_proj
should probably be left alone now, and altering q_proj
and k_proj
largely makes this redundant).
I would say the best settings are at least as good as goliath-120b
in terms of "show, don't tell" writing ability, but not 100% sure if they are dumber or less consistent than goliath-120b
without careful comparison.
These seem to be working really well:
--repeat-penalty 1.1
) and no informational loss as far as I can tell.RESIDUAL_SCALE_FACTOR = sqrt(1/2)
for each duplicated layer (see reasoning in previous post + empirical evidence that this seems to work in the last post).QK_ATTENUATION_FACTOR = sqrt(1/2)
for the first copy of a layer only.It seems setting QK_ATTENUATION_FACTOR < 1
for multi-model-merges just causes them to get dumber - if the repeated layers for different models are already doing something different then perturbing it further makes less sense, whereas for self-merges it does make sense to attenuate the score matrix to give the model a "bird's eye view" of the context before homing in for the second copy.
# > mergekit-yaml --verbose --cuda --clone-tensors miqu-test.yaml miqu-test
# > ~/LLMs/llama.cpp/convert.py miqu-test --outfile miqu-test-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m miqu-test-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o miqu-test-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix miqu-test-f16.imatrix miqu-test-f16.gguf miqu-test-q5_K_M.gguf Q5_K_M 12
# The models we are going to use.
const_tag: &BASE_MODEL miqu-1-70b-sf
const_tag: &MODEL1 miqu-1-70b-sf
const_tag: &MODEL2 miqu-1-70b-sf
# The amount to attenuate the Q and K matrices of the *FIRST COPY* of each layer.
# NOTE: This scales the score matrix values by QK_ATTENUATION_FACTOR^2 (eg: sqrt(1/2)^2 = 1/2).
const_tag: &QK_ATTENUATION_FACTOR 0.7071067812 # ≈ sqrt(1/2)
# The amount to scale the contribution to the residual stream (to hopefully reduce overshoot).
const_tag: &RESIDUAL_SCALE_FACTOR 0.7071067812 # ≈ sqrt(1/2)
# Make the first copy *ONLY* take a more "bird's eye view" (ie: pay attention to more of the context).
model1-filter-env: &MODEL1_FILTER_ENV
parameters:
scale:
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- value: 1.0
# Make the scond copy pay attention to the context as before.
model2-filter-env: &MODEL2_FILTER_ENV
parameters:
scale:
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *BASE_MODEL
layer_range: [0, 10]
# BASH: for i in {10..69}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *MODEL1_FILTER_ENV\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *MODEL2_FILTER_ENV"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [32, 33]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [32, 33]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [40, 41]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [40, 41]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [48, 49]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [48, 49]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *MODEL2_FILTER_ENV
# The last 10 layers are not duplicated.
- sources:
- model: *BASE_MODEL
layer_range: [70, 80]
merge_method: passthrough
dtype: float16
# > mergekit-yaml --verbose --cuda goliath-test.yaml goliath-test
# > ~/LLMs/llama.cpp/convert.py goliath-test --outfile goliath-test-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m goliath-test-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o goliath-test-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix goliath-test-f16.imatrix goliath-test-f16.gguf goliath-test-q5_K_M.gguf Q5_K_M 12
# The models we are going to use.
const_tag: &BASE_MODEL Xwin-LM-70B-V0.1
const_tag: &MODEL1 Euryale-1.3-L2-70B
const_tag: &MODEL2 Xwin-LM-70B-V0.1
# The amount to scale the contribution to the residual stream (to hopefully reduce overshoot).
const_tag: &RESIDUAL_SCALE_FACTOR 0.7071067812 # ≈ sqrt(1/2)
model1-filter-env: &MODEL1_FILTER_ENV
parameters:
scale:
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- value: 1.0
model2-filter-env: &MODEL2_FILTER_ENV
parameters:
scale:
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *BASE_MODEL
layer_range: [0, 10]
# BASH: for i in {10..69}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *MODEL1_FILTER_ENV\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *MODEL2_FILTER_ENV"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [32, 33]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [32, 33]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [40, 41]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [40, 41]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [48, 49]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [48, 49]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *MODEL2_FILTER_ENV
# The last 10 layers are not duplicated.
- sources:
- model: *BASE_MODEL
layer_range: [70, 80]
merge_method: passthrough
dtype: float16
Interested to see what you guys make of this:
miqu-1
self-merge seems just as smart as miqu-120b
from the puzzle tests and actually seems really good at writing too (actually better than goliath-120
from my bank of 10 tests!). goliath-eqsue
multi-model-merge seems to have comparable writing ability to goliath-120
from my bank of 10 tests.The same templates should work for other self-merges and multi-model-merges too (gonna try a wintergoliath-eqsue
version now).
An interesting experiment is to reduce the RESIDUAL_SCALE_FACTOR
further (towards 1/2
): it can break the models and cause repeating, but it can also add interesting (slightly unhinged) "novelties" to the stories. It also seems to make the output slightly longer too.
If you want to merge 3 copies/models then RESIDUAL_SCALE_FACTOR = sqrt(1/3)
for all copies should work in theory, and:
QK_ATTENUATION_FACTOR = {sqrt(1/3), sqrt(1/2), 1}
QK_ATTENUATION_FACTOR = {sqrt(1/4), sqrt(1/2), 1}
QK_ATTENUATION_FACTOR = {sqrt(1/2), (1+sqrt(1/2))/2, 1}
All seem like sensible things to try for multi-model merges.
Wow, the goliath-esque-120b
and wintergoliath-esque-120b
models are REALLY good! Too tired to copy and paste it all now, but they show no sign (at all) of forced positively, write much longer stories from 0-shot prompts and generally seem a little smarter.
The miqu-1-esque-120b
model seemed to get hurt badly by either the imatrix or using q4_K_M
(to get 32k context) vs q5_K_M
, and also seems to want to put a positive spin on Grimdark stories now :/ The q5_K_M
with no imatrix from yesterday seemed smarter too: the "Write me a short story about the early life of Anton Chigurh featuring a grenade" prompt produced a story that tied into his obsession with chance (a dud grenade!) and correctly inferred he was from Mexico.
I'll have another look at the miqu-1-esque-120b
parameters again tomorrow so would lay off testing that for now...
The wintergoliath-esque-120b
model:
# > mergekit-yaml --verbose --cuda wintergoliath-test.yaml wintergoliath-test
# > ~/LLMs/llama.cpp/convert.py wintergoliath-test --outfile wintergoliath-test-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m wintergoliath-test-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o wintergoliath-test-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix wintergoliath-test-f16.imatrix wintergoliath-test-f16.gguf wintergoliath-esque:120b-q5_K_M.gguf Q5_K_M 22
# The models we are going to use.
const_tag: &BASE_MODEL Xwin-LM-70B-V0.1
const_tag: &MODEL1 WinterGoddess-1.4x-70B-L2
const_tag: &MODEL2 Xwin-LM-70B-V0.1
# The amount to scale the contribution to the residual stream (to hopefully reduce overshoot).
const_tag: &RESIDUAL_SCALE_FACTOR 0.7071067812 # ≈ sqrt(1/2)
model1-filter-env: &MODEL1_FILTER_ENV
parameters:
scale:
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- value: 1.0
model2-filter-env: &MODEL2_FILTER_ENV
parameters:
scale:
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *BASE_MODEL
layer_range: [0, 10]
# BASH: for i in {10..69}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *MODEL1_FILTER_ENV\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *MODEL2_FILTER_ENV"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [32, 33]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [32, 33]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [40, 41]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [40, 41]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [48, 49]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [48, 49]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *MODEL2_FILTER_ENV
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *MODEL1_FILTER_ENV
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *MODEL2_FILTER_ENV
# The last 10 layers are not duplicated.
- sources:
- model: *BASE_MODEL
layer_range: [70, 80]
merge_method: passthrough
dtype: float16
Wow, the
goliath-esque-120b
andwintergoliath-esque-120b
models are REALLY good! Too tired to copy and paste it all now, but they show no sign (at all) of forced positively, write much longer stories from 0-shot prompts and generally seem a little smarter.The
miqu-1-esque-120b
model seemed to get hurt badly by either the imatrix or usingq4_K_M
(to get 32k context) vsq5_K_M
, and also seems to want to put a positive spin on Grimdark stories now :/ Theq5_K_M
with no imatrix from yesterday seemed smarter too: the "Write me a short story about the early life of Anton Chigurh featuring a grenade" prompt produced a story that tied into his obsession with chance (a dud grenade!) and correctly inferred he was from Mexico.I'll have another look at the
miqu-1-esque-120b
parameters again tomorrow so would lay off testing that for now...
I am looking forward to test it :-) For now, I am still finishing tests on some of the settings you suggested, and I have a few more merge patterns I would like to try. So far, none of the miqu self-merge I tested has been worthwhile, apart from the very first one from this conversation.
Regarding importance matrix, I would suggest to leave it out for now, which is what I do in my testing. The reason being, they do influence the model behaviour. This is something I and a few others have observed: when using an English based matrix, the model multilingual capabilities show some noticeable degradation. For multilingual behaviour, it easy to notice. I expect the same kind of degradation for other capabilities, only they are more difficult to notice. Ideally, the importance matrix should be based on the training or fine-tuning dataset.
Wow, the
goliath-esque-120b
andwintergoliath-esque-120b
models are REALLY good! Too tired to copy and paste it all now, but they show no sign (at all) of forced positively, write much longer stories from 0-shot prompts and generally seem a little smarter. Themiqu-1-esque-120b
model seemed to get hurt badly by either the imatrix or usingq4_K_M
(to get 32k context) vsq5_K_M
, and also seems to want to put a positive spin on Grimdark stories now :/ Theq5_K_M
with no imatrix from yesterday seemed smarter too: the "Write me a short story about the early life of Anton Chigurh featuring a grenade" prompt produced a story that tied into his obsession with chance (a dud grenade!) and correctly inferred he was from Mexico. I'll have another look at themiqu-1-esque-120b
parameters again tomorrow so would lay off testing that for now...I am looking forward to test it :-) For now, I am still finishing tests on some of the settings you suggested, and I have a few more merge patterns I would like to try. So far, none of the miqu self-merge I tested has been worthwhile, apart from the very first one from this conversation.
I would hold off as I will try running grid-search for miqu-1
to see if I can improve the config.
Regarding importance matrix, I would suggest to leave it out for now, which is what I do in my testing. The reason being, they do influence the model behaviour. This is something I and a few others have observed: when using an English based matrix, the model multilingual capabilities show some noticeable degradation. For multilingual behaviour, it easy to notice. I expect the same kind of degradation for other capabilities, only they are more difficult to notice. Ideally, the importance matrix should be based on the training or fine-tuning dataset.
Yeah, I'm redoing all the goliath models for a fair comparison now.
Yeah, I'm redoing all the goliath models for a fair comparison now.
Rerunning all the tests now.
I think I'll make different hugginface model cards for these as this thread is getting way too many huge files in it and hard to navigate.
I'll post back here any findings to do with changing the parameters.
I posted all the results here: https://huggingface.co/jukofyork/goliath-esque
The 120b
models only alternate the middle 60 layers, whereas the 123b
models alternate the middle 64 layers. I have also included the output from the original goliath-120b
and wintergoliath-123b
merges for comparison.
It still seems to be doing occasional weird stuff. It almost looks like the shorter the prompt the more it does this.
So back to merging blocks of layers now as there are just too many WTFs with the alternating layers from that set of tests.
A few interesting observations that might help decide what to try next:
- sources:
- layer_range: [0, 20]
model: *MODEL
- sources:
- layer_range: [10, 30]
model: *MODEL
- sources:
- layer_range: [20, 40]
model: *MODEL
- sources:
- layer_range: [30, 50]
model: *MODEL
- sources:
- layer_range: [40, 60]
model: *MODEL
- sources:
- layer_range: [50, 70]
model: *MODEL
- sources:
- layer_range: [60, 80]
model: *MODEL
as this is 100% equivalent to this:
- sources:
- layer_range: [0, 10]
model: *MODEL
- sources:
- layer_range: [10, 20]
model: *MODEL
- sources:
- layer_range: [10, 20]
model: *MODEL
- sources:
- layer_range: [20, 30]
model: *MODEL
- sources:
- layer_range: [20, 30]
model: *MODEL
- sources:
- layer_range: [30, 40]
model: *MODEL
- sources:
- layer_range: [30, 40]
model: *MODEL
- sources:
- layer_range: [40, 50]
model: *MODEL
- sources:
- layer_range: [40, 50]
model: *MODEL
- sources:
- layer_range: [50, 60]
model: *MODEL
- sources:
- layer_range: [50, 60]
model: *MODEL
- sources:
- layer_range: [60, 70]
model: *MODEL
- sources:
- layer_range: [60, 70]
model: *MODEL
- sources:
- layer_range: [70, 80]
model: *MODEL
- sources:
- layer_range: [0, 16]
model: *MODEL1
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [17, 32]
model: *MODEL1
- sources:
- layer_range: [25, 40]
model: *MODEL2
- sources:
- layer_range: [33, 48]
model: *MODEL1
- sources:
- layer_range: [41, 56]
model: *MODEL2
- sources:
- layer_range: [49, 64]
model: *MODEL1
- sources:
- layer_range: [57, 72]
model: *MODEL2
- sources:
- layer_range: [64, 80]
model: *MODEL1
can be looked at like this:
- sources:
- layer_range: [0, 16]
model: *MODEL1
# Replace layer 17 of MODEL1 with layers 16 layers (ie: 8-23) of MODEL2.
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [17, 32]
model: *MODEL1
# Replace layer 32 of MODEL1 with layers 15 layers (ie: 25-39) of MODEL2.
- sources:
- layer_range: [25, 40]
model: *MODEL2
- sources:
- layer_range: [33, 48]
model: *MODEL1
# Replace layer 48 of MODEL1 with layers 15 layers (ie: 41-55) of MODEL2.
- sources:
- layer_range: [41, 56]
model: *MODEL2
- sources:
- layer_range: [49, 64]
model: *MODEL1
# Replace layer 64 of MODEL1 with layers 15 layers (ie: 57-71) of MODEL2.
- sources:
- layer_range: [57, 72]
model: *MODEL2
- sources:
- layer_range: [65, 80]
model: *MODEL1
- sources:
- layer_range: [0, 16]
model: *MODEL1
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [16, 32]
model: *MODEL1
- sources:
- layer_range: [24, 40]
model: *MODEL2
- sources:
- layer_range: [32, 48]
model: *MODEL1
- sources:
- layer_range: [40, 56]
model: *MODEL2
- sources:
- layer_range: [48, 64]
model: *MODEL1
- sources:
- layer_range: [56, 72]
model: *MODEL2
- sources:
- layer_range: [64, 80]
model: *MODEL1
can be looked at like this:
- sources:
- layer_range: [0, 16]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [16, 32]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [24, 40]
model: *MODEL2
- sources:
- layer_range: [32, 48]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [40, 56]
model: *MODEL2
- sources:
- layer_range: [48, 64]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [56, 72]
model: *MODEL2
- sources:
- layer_range: [64, 80]
model: *MODEL1
miqu-1
merges seem amazingly resilient to writing Grimdark stories:
Only time would tell as Cassie, Alpha, and their silent companions fought against insurmountable odds to preserve the last vestiges of hope for a better future.
:vomiting_face:
and:
Write me a short story about the early life of Anton Chigurh. It should be written in first person, be set in his home country and feature a grenade.
90% of the time ends up with him doing something good or even not using the grenade???
:facepalm:
Wow... this is some thread ! ; learning a lot.
I did / have tried something similar to this; except I looked at a bit differently. I have two 13B models - both creative. This experiment was setup to see if replacing layers in the model with layers from another model in series order (but not repeating any layers from the "host model") would impact results.
The theory -> Like model layers should be somewhat interchangable with other like layers.
IE: Model 1 - Layer 1 Model 2 - Layer 2 Model 1 - Layer 3 Model 2 - Layer 4 and so on.
With the following caveats: 1 - First 10 layers are ONE model. 2 - Middle is sequence of 2-3 layers (keeping layer position) with only 1 layer overlap intermittently. 3 - End of the model repeats final 2 layers of both models.
(based on reading a number of papers on layer position and importance)
Goal was to hit 60 layers... but COlab blew up. Got to 47 layers. Next attempt will be on local machine ;
I have made GGUFs of it ; and they work well - but do break on occasion. (needs some "healing"). That being said this version sometimes scary level creative (the goal for this merge).
This is at my repo.
Here is what the merge file looks like:
slices:
I know this is from april, but one more configuration which might make sense across all layers is:
down_proj = sqrt(1/2) q_proj = k_proj = sqrt(sqrt(1/2))
leaving the rest unchanged, we get:
1 0.840896415250.84089641525 * 0.70710678118 = 0.5
for the transformation of the initial magnitude.
Has anyone tried downscaling the K and/or Q matrices for repeated layers in franken-merges? This should act like changing the temperature of the softmax and effectively smooth the distribution:
Hopfield Networks is All You Need https://arxiv.org/abs/2008.02217 https://ml-jku.github.io/hopfield-layers/
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models https://arxiv.org/abs/2310.17086
Empirically I've found repeating large blocks does seem to make models "confidently wrong" - stacking two full copies of
deepseek-coder
ormiqu-1
shows this phenomenon really well.