Open jukofyork opened 4 months ago
This is an interesting idea! I've played around with downscaling o_proj
/down_proj
to just change the magnitude of the residual from layers but I'd definitely like to see what happens here.
@cg123 @shamanez I don't know if you've seen 3blue1brown's last couple of videos:
https://youtube.com/c/3blue1brown
but his interpretation of what the transformer is doing makes me really think the downscaling is even more important.
This gives a 3rd way to consider how stacking multiple layers could improve models, ie: more fine-grained movement within the high dimensional vector space leading to a more accurate final position!
The W_K and W_Q matrices both being scaled by 1/sqrt(2) would be equivalent to halving the score matrix values (which may or may not be the best scale factor, but probably a good start).
The interesting point the 3blue1brown video raises is that the W_V matrix should also probably be halved too. If it isn't then his sum of vectors interpretation will overshoot even if you deattenuate the sharpness of the softmax!
BUT: This doesn't account for the MLP layer that follows the transformer and the downscaled outputs of softmax * V might be too weak to push through the non-linearity of the MLP (or more likely just be plain wrong for what the MLP is expecting to see).
So perhaps the W_V matrix should be left alone and the norm layer be downscaled instead?
BUT: From some reading a few weeks ago, IIRC the current LLMs actually apply their norm layers before the operations for some reason (compared to the original "Attention is all you need" paper) and that's why there is the extra norm layer right at the end that I couldn't work out the point of having before...
So it's possible the residuals and/or the norm layers will both need downscaling so the vector addition interpretation is maintained?
Also the "confidentiality wrong" phenomenon can just as easily be explained by the overshooting in the vector space as it can by the sharpness of the softmax output (probably even more so!).
I've just realised there is a super easy way to do this using the filters: just make a copy of the model and zero all the weights out and then use the linear merge method.
This would also be useful for a method described in a paper I can't find atm (related to the Solar 10.7b
paper but not it) where a single transformer block is duplicated and placed after the original, and then 2 of the weight matrices in the block (one before and one after IIRC) are set to zero with the intension of making the whole block just perform the identity function and pass straight through. The idea being that this allows (especially instruction tuned) models to avoid catastrophic forgetting (somebody used this to upscale Mistral but again I can't find that model now).
Obviously it would be better if we could somehow specify to interpolate towards zero in the yaml file rather than have to make a zeroed model for this! :)
@cg123, I was just looking through the source of passthrough.py
and its test and noticed you have added a scale
parameter!!!
Is this the correct use of it:
dtype: float16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 10]
model: miqu-1-70b-sf
- sources:
- layer_range: [10, 11]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [11, 20]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [10, 30]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [20, 40]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [30, 50]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [40, 60]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [50, 70]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [60, 70]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [70, 71]
model: miqu-1-70b-sf
parameters:
scale:
- filter: input_layernorm
value: 0.5
- value: 1
- sources:
- layer_range: [71, 80]
model: miqu-1-70b-sf
Do any of the models ever use an affine projection in the transformer part (ie: q_proj.bias
or k_proj.bias
) or is it safe to assume not?
https://github.com/arcee-ai/mergekit/tree/main/mergekit/_data/architectures https://github.com/arcee-ai/mergekit/blob/main/mergekit/_data/architectures/llama.json
I can't see any .bias
values, but can see the input_layernorm
so this makes the yaml
file a bit more complex.
Really excited to try this now!!!
If this actually does anything useful then I will write some C++ to try to explore the scale factor over lots of random inputs:
q_proj
and k_proj
by 1/sqrt(k)
like the above example or else quantized versions of the model are going to suffer badly due to the uneven distribution of information if we just scale q_proj
or k_proj
by 0.5
.I've just realised there is a super easy way to do this using the filters: just make a copy of the model and zero all the weights out and then use the linear merge method.
This would also be useful for a method described in a paper I can't find atm (related to the
Solar 10.7b
paper but not it) where a single transformer block is duplicated and placed after the original, and then 2 of the weight matrices in the block (one before and one after IIRC) are set to zero with the intension of making the whole block just perform the identity function and pass straight through. The idea being that this allows (especially instruction tuned) models to avoid catastrophic forgetting (somebody used this to upscale Mistral but again I can't find that model now).Obviously it would be better if we could somehow specify to interpolate towards zero in the yaml file rather than have to make a zeroed model for this! :)
Also, if anybody knows the name of this paper, I will make a couple of examples for both this "attenuated passthrough" idea above and the "identity duplication" idea from this paper using the new scale
parameter.
Added some logging to passthrough.py
and seem to be working:
Warmup loader cache: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 540.64it/s]
INFO:root:Planning operations
5%|██████▌ | 213/4514 [00:01<00:22, 195.29it/s]INFO:root:Writing shard #1 to disk
6%|████████▉ | 289/4514 [00:27<08:20, 8.44it/s]INFO:root:scale: 1.0
7%|█████████▍ | 308/4514 [01:05<41:42, 1.68it/s]INFO:root:scale: 1.0
7%|█████████▌ | 311/4514 [01:05<39:33, 1.77it/s]INFO:root:scale: 1.0
7%|█████████▌ | 315/4514 [01:40<1:33:39, 1.34s/it]INFO:root:scale: 1.0
7%|█████████▋ | 319/4514 [01:41<1:22:46, 1.18s/it]INFO:root:scale: 1.0
7%|█████████▊ | 322/4514 [02:15<2:43:13, 2.34s/it]INFO:root:scale: 1.0
7%|█████████▉ | 326/4514 [02:15<2:14:41, 1.93s/it]INFO:root:scale: 1.0
INFO:root:scale: 1.0
INFO:root:scale: 1.0
7%|██████████▏ | 336/4514 [02:26<1:52:01, 1.61s/it]INFO:root:scale: 1.0
8%|██████████▎ | 341/4514 [02:26<1:28:40, 1.27s/it]INFO:root:scale: 1.0
INFO:root:scale: 1.0
8%|██████████▌ | 349/4514 [02:28<1:02:57, 1.10it/s]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.7071067812
8%|██████████▉ | 355/4514 [02:28<49:02, 1.41it/s]INFO:root:scale: 0.7071067812
8%|██████████▉ | 359/4514 [02:39<1:15:02, 1.08s/it]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.5
INFO:root:scale: 1.0
9%|████████████▏ | 400/4514 [05:05<2:47:09, 2.44s/it]INFO:root:Writing shard #2 to disk
10%|█████████████ | 429/4514 [07:45<4:07:27, 3.63s/it]INFO:root:scale: 1.0
10%|█████████████ | 431/4514 [08:30<6:58:15, 6.15s/it]INFO:root:scale: 1.0
10%|█████████████▏ | 435/4514 [08:30<5:02:39, 4.45s/it]INFO:root:scale: 1.0
10%|█████████████▎ | 438/4514 [09:11<7:23:44, 6.53s/it]INFO:root:scale: 1.0
10%|█████████████▍ | 442/4514 [09:11<5:09:26, 4.56s/it]INFO:root:scale: 1.0
10%|█████████████▌ | 445/4514 [09:26<5:16:07, 4.66s/it]INFO:root:scale: 1.0
10%|█████████████▋ | 450/4514 [09:26<3:20:03, 2.95s/it]INFO:root:scale: 1.0
10%|█████████████▋ | 453/4514 [09:27<2:38:44, 2.35s/it]INFO:root:scale: 1.0
INFO:root:scale: 0.7071067812
10%|█████████████▉ | 459/4514 [09:28<1:38:44, 1.46s/it]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.7071067812
10%|██████████████▏ | 466/4514 [09:38<1:36:08, 1.42s/it]INFO:root:scale: 0.7071067812
10%|██████████████▎ | 471/4514 [09:38<1:08:32, 1.02s/it]INFO:root:scale: 1.0
11%|██████████████▍ | 474/4514 [10:16<3:38:51, 3.25s/it]INFO:root:scale: 1.0
11%|██████████████▍ | 477/4514 [10:16<2:37:11, 2.34s/it]INFO:root:scale: 1.0
11%|██████████████▌ | 480/4514 [10:57<6:19:56, 5.65s/it]INFO:root:scale: 1.0
11%|██████████████▋ | 483/4514 [10:57<4:29:28, 4.01s/it]INFO:root:scale: 1.0
11%|██████████████▊ | 487/4514 [11:37<6:52:04, 6.14s/it]INFO:root:scale: 1.0
11%|██████████████▉ | 493/4514 [11:38<3:36:45, 3.23s/it]INFO:root:scale: 1.0
INFO:root:scale: 1.0
11%|███████████████▏ | 499/4514 [11:38<2:00:05, 1.79s/it]INFO:root:scale: 1.0
11%|███████████████▏ | 502/4514 [11:48<2:24:39, 2.16s/it]INFO:root:scale: 1.0
11%|███████████████▎ | 506/4514 [11:48<1:39:32, 1.49s/it]INFO:root:scale: 1.0
11%|███████████████▍ | 509/4514 [11:49<1:21:36, 1.22s/it]INFO:root:scale: 1.0
INFO:root:scale: 0.7071067812
11%|███████████████▊ | 515/4514 [11:50<52:05, 1.28it/s]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.7071067812
12%|███████████████▊ | 522/4514 [12:01<1:10:35, 1.06s/it]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.5
INFO:root:scale: 0.5
12%|████████████████▍ | 534/4514 [12:01<35:28, 1.87it/s]INFO:root:scale: 1.0
12%|████████████████▎ | 536/4514 [12:34<2:29:23, 2.25s/it]INFO:root:scale: 1.0
12%|████████████████▎ | 539/4514 [12:34<2:01:58, 1.84s/it]INFO:root:scale: 1.0
12%|████████████████▌ | 544/4514 [13:12<3:51:04, 3.49s/it]INFO:root:scale: 1.0
INFO:root:scale: 1.0
12%|████████████████▋ | 550/4514 [13:50<5:10:36, 4.70s/it]INFO:root:scale: 1.0
12%|████████████████▊ | 553/4514 [13:50<3:59:01, 3.62s/it]INFO:root:Writing shard #3 to disk
12%|████████████████▊ | 554/4514 [14:16<6:29:00, 5.89s/it]INFO:root:scale: 1.0
INFO:root:scale: 1.0
12%|█████████████████ | 562/4514 [14:16<3:02:17, 2.77s/it]INFO:root:scale: 1.0
13%|█████████████████▏ | 566/4514 [14:28<3:02:54, 2.78s/it]INFO:root:scale: 1.0
13%|█████████████████▎ | 569/4514 [14:28<2:21:44, 2.16s/it]INFO:root:scale: 1.0
13%|█████████████████▍ | 573/4514 [14:30<1:49:35, 1.67s/it]INFO:root:scale: 1.0
INFO:root:scale: 0.7071067812
13%|█████████████████▌ | 578/4514 [14:31<1:16:57, 1.17s/it]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.7071067812
13%|█████████████████▊ | 585/4514 [14:43<1:31:49, 1.40s/it]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.5
INFO:root:scale: 0.5
13%|██████████████████▍ | 597/4514 [14:44<46:25, 1.41it/s]INFO:root:scale: 1.0
13%|██████████████████▏ | 599/4514 [15:24<3:03:37, 2.81s/it]INFO:root:scale: 1.0
13%|██████████████████▎ | 603/4514 [15:25<2:21:02, 2.16s/it]INFO:root:scale: 1.0
13%|██████████████████▍ | 606/4514 [15:34<2:32:20, 2.34s/it]INFO:root:scale: 1.0
14%|██████████████████▌ | 611/4514 [15:34<1:43:36, 1.59s/it]INFO:root:scale: 1.0
14%|██████████████████▋ | 614/4514 [15:36<1:30:01, 1.38s/it]INFO:root:scale: 1.0
INFO:root:scale: 0.7071067812
14%|██████████████████▊ | 620/4514 [15:37<1:01:34, 1.05it/s]INFO:root:scale: 0.7071067812
INFO:root:scale: 0.7071067812
14%|███████████████████ | 627/4514 [15:48<1:14:57, 1.16s/it]INFO:root:scale: 0.7071067812
14%|███████████████████▍ | 632/4514 [15:48<54:06, 1.20it/s]INFO:root:scale: 1.0
It would be nice to print the tenor names to be 100% sure though, but my Python skills are pretty much zero and runtime type-checking of long-runtime jobs reminds me exactly why... :frowning:
I've double checked and am fairly sure this is the correct set of scale
parameters too:
We want to scale W_q
and W_k
so the dot-product ends up halved (for now) and irritatingly, it's the input norm weights of the next block need to be halved so as to not overshoot (I think; for now).
It will probably take all afternoon to test what this does, but I think this self-merge of miqu-1
is good candidate as from past testing against it; there were some definite holes in its knowledge and it would hallucinate names more.
The only other merged model I have used extensively is goliath-120b
so will see if I can get a copy of its parent models overnight and try this with that.
Just realised I'm double scaling by altering the input norm like that... Gonna leave that for now:
dtype: float16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 10]
model: miqu-1-70b-sf
- sources:
- layer_range: [10, 20]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [10, 30]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [20, 40]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [30, 50]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [40, 60]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [50, 70]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [60, 70]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [70, 80]
model: miqu-1-70b-sf
and see what this does to start with.
I really need to know the architecture of exactly what the MLP layers look like: it could be the *.mlp.down_proj.weight
values that should be downscaled, but I've no idea if this could be before some non-linearity (in which case it would also be a bad idea) or just a linear transform to get back to the embedding dimension again.
Either way I think this is less interesting that the scaling of the score matrix viaW_q
and W_k
anyway.
Holly crap this does work!!! Using temperature = 0
, repetition penalty = 1
and no system prompt:
### USER
Two people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing?
### ASSISTANT
The other person is also operating the machine. The statement "requires two people to operate" implies that both individuals are involved in its operation.
### USER
Sally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have?
### ASSISTANT
Sally has 3 sisters. The information about their brothers is not relevant to the question. So, Sally still has 3 sister.
(this was a copy and paste fail from me, but he didn't fall for it! :grin:)
### USER
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
### ASSISTANT
Sally has 2 sisters.
Here's the reasoning:
1. Sally is a girl and has 3 brothers.
2. Each of her brothers has 2 sisters, which means each brother has one sister (Sally) in common with each other, and one more sister in addition to Sally.
3. Since there are 3 brothers, this accounts for 2 additional sisters (1 from each brother's perspective) + Sally herself, making it a total of 3 girls including Sally herself.
So, Sally has 2 sisters in total (excluding herself).
https://old.reddit.com/r/LocalLLaMA/comments/15kntrq/i_asked_a_simple_riddle_to_30_models/
### USER
Solve this family relationship quiz: Marie is Emma's parent. Emma is Carl's parent. Carl is Timothy's parent. Timothy is male. Who's Timothy to Marie?
### ASSISTANT
Based on the information provided:
1. Marie is Emma's parent.
2. Emma is Carl's parent.
3. Carl is Timothy's parent.
4. Timothy is male.
From point 1, we know that Marie is either the mother or father of Emma (gender not specified).
From point 2, we learn that Emma is the mother or father of Carl (again, gender not specified).
From point 3, we find out that Carl is Timothy's parent. Since Timothy is male, this means that either Carl is his father or his mother. However, since Timothy has already been established as male in the problem statement, it follows that Carl must be Timothy's father.
Now let's put all these relationships together:
1. Marie -> Emma (parent-child)
2. Emma -> Carl (parent-child)
3. Carl -> Timothy (father-son)
Since we know that Marie is Emma's parent, and Emma is the parent of Carl, this makes Marie either the grandmother or grandfather of Timothy. Therefore, Timothy is Marie's great-grandchild or great-grandson/daughter depending on their gender. However, since Timothy has been established as male in the problem statement, it follows that he is Marie's great-grandson.
In conclusion, Timothy is Marie's great-grandson.
Now for the slightly more problematic puzzles this one is getting wrong. The old miqu-120b
could sometimes solve stuff like this if you added "Think step-by-step as this is a tricky problem:" on to the end:
### USER
Solve this family relationship quiz: Betty is Brenda's parent. Betty is Joe's parent. Joe is Charles' parent. Brenda is female. Who's Brenda to Charles? Think step-by-step as this is a tricky problem:
### ASSISTANT
Let's break this down step by step:
1. Betty is Brenda's parent. This means either Betty is Brenda's mother or father.
2. Betty is Joe's parent as well, so either Betty is Joe's mother or father too.
3. Joe is Charles' parent, which means Joe is either Charles' mother or father.
4. Brenda and Joe are siblings since they share a common parent (Betty).
5. Since Brenda is female, neither she nor her sibling (Joe) can be Charles' father, so Joe must be Charles' mother.
6. Therefore, Brenda is Charles' aunt (either his aunt by blood or his half-aunt if Joe is his/her step-parent).
So Brenda is Charles' aunt (either biologically or through marriage).
### USER
Solve this family relationship quiz: Willie is Sara's parent. Willie is Zachary's parent. Zachary is Kyle's parent. Sara is female. Who's Sara to Kyle? Think step-by-step as this is a tricky problem:
### ASSISTANT
Let's break this down step by step:
1. Willie is Sara's parent, so Willie is Sara's father or mother.
2. Willie is also Zachary's parent, so Willie is Zachary's father or mother as well.
3. Since Willie cannot be both the father and mother of Sara and Zachary (as they are not twins), we can conclude that either Sara and Zachary are siblings or cousins.
4. However, since there is no mention of any other siblings or spouses in the question, it's safe to assume that Sara and Zachary are siblings.
5. Now, we know that Zachary is Kyle's parent, so either Kyle is Zachary's son or daughter.
6. Since Sara and Zachary are siblings, this makes Sara either Kyle's aunt or aunt-in-law (if she married into the family).
7. However, since there is no mention of any marriages in the question, we can rule out the possibility of Sara being Kyle's mother-in-law.
8. Therefore, Sara must be Kyle's aunt.
So, Sara is Kyle's aunt.
### USER
Solve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward? Think step-by-step as this is a tricky problem:
### ASSISTANT
Let's break this down step by step:
1. Nathan is Timothy's parent. This means that either Nathan is Timothy's father or mother.
2. Nathan is also Aaron's parent, which means that either Nathan is Aaron's father or mother as well.
3. Aaron is Edward's parent, so Aaron could be Edward's father or mother.
4. We know that Timothy is male, so Nathan cannot be Timothy's mother because same-sex marriage isn't mentioned in the question and we are assuming heterosexual relationships for simplicity here.
5. Since Timothy is male, Nathan must be Timothy's father (not mother).
6. Now, if Aaron is Edward's parent, then Aaron cannot be a woman because Timothy has already been established as male. So Aaron must be Edward's father.
7. Therefore, since Nathan is both Timothy and Aaron's parent, this means that Timothy and Aaron are siblings (either twins or close in age).
8. Since Aaron is Edward's father, then Timothy must be Edward's uncle (either his maternal or paternal uncle).
So to answer the question "Who's Timothy to Edward?" The answer is that Timothy is either Edward's maternal or paternal uncle.
This also doesn't seem to "lose" the information like the original miqu-120b
did:
### USER
Can you summarise the ideas behind Pollack's 1990 "Recursive distributed representations" paper?
### ASSISTANT
Sure, I'd be happy to summarize the main ideas in Pollack's 1990 paper "Recursive Distributed Representations."
In this influential paper, Jordan Pollack introduced the idea of using connectionist networks with recurrent connections to model hierarchical representations of natural language. The key insight was that by recursively composing simpler representations into more complex ones, a network could learn to represent complex structures such as sentences or even entire stories.
The main contributions of this paper include:
1. **Recursive compositionality**: Pollack proposed that language understanding and generation can be modeled using a connectionist architecture that recursively composes simpler representations into more complex ones. This allows the network to learn hierarchical representations of natural language, such as sentences or even entire stories.
2. **Connectionist networks with recurrent connections**: The model uses recurrent connections to allow information from previous time steps to influence current processing, allowing the network to maintain context over time and build up complex representations incrementally.
3. **Distributed representations**: Pollack proposed that distributed representations can capture the meaning of words, phrases, and even entire stories by distributing their meanings across many hidden units in a connectionist network.
4. **Recursive compositionality**: The model recursively composes simpler representations into more complex ones, allowing the network to learn hierarchical structures such as sentences or even entire stories.
Overall, Pollack's paper laid the groundwork for modern deep learning approaches to natural language processing and has had a lasting impact on the field of artificial intelligence research.
This is a question I have definitely asked the miqu-120b
merge and he didn't know that Pollack's first name was Jordan and hallucinated something else (even though the original miqu-1
model did know his name).
He doesn't seem to have mentioned the key insight from the paper - compressing sequences to a fixed length vector (but I don't know if the original miqu-1
knew this either).
THIS IS NUTS!!!
So now the question is:
miqu-1
on top of each other does it still become really verbose and "confidently wrong".Actually the more I think about this the more I think maybe we don't actually need to mess with the norm(s) or mlp.down_proj.weight
: Just by de-attenuating the softmax via the score-matrix we will be mixing more varied values from V and likely these will cancel each other out in different directions and be less prone to "overshooting" anyway"?
Yep, double-stacked miqu-1
is just as bad:
dtype: float16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 80]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [0, 80]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
### USER
Two people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing?
### ASSISTANT
The other person's role could vary depending on the specifics of the machine and the task at hand.
1. If their role is not explicitly stated in your question, then one possible assumption could be that they are:
a) Observing or monitoring the machine's operation to ensure it's running smoothly
OR
b) Following any necessary safety precautions
However, without more specifics about the scenario, this is a generalization and not definitive.
In some cases they might also be:
- Providing materials needed for the machine operation
- Taking note of any relevant data
Without more context or a clearer prompt it's hard to give an exact answer.
Confidence:
90%
Note that this is a best-educated guess response based on the information given in your question and not all possible factors may have been considered.
### USER
Sally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have?
### ASSISTANT
The information provided in the question states that each of Sally's sisters has one brother but it does not state that all of them share the same brother or siblings in common apart from each other's siblings.
However, based on the information given we can conclude:
1. Each sister of Sally has 1 brother
2. But, the question does not state that all sisters are from same parents and have to be blood-related siblings
3. The brothers of each sisters might not be the same person
Hence, it's not possible to conclude an exact number of sisters for Sally with the information given in the problem statement.
The question is a tricky one designed to mislead you into a wrong conclusion that all siblings are from same parents and will have the same brother in common which is not true and can't be concluded from this riddle's statement.
So, we cannot definitively say the exact number of sisters for Sally with the information given in the question.
The correct answer is:
We cannot be sure of a definite number of sisters for 'Sally'.
Hence,
The best and most logical conclusion is that:
'It's not possible to say an exact number of sisters for 'Sally'
So the final conclusion is:
The question does not give enough information to decide the exact number of sisters for 'Sally'.
Hence,
The best and most logical conclusion is that:
'It's not possible to say an exact number of sisters for 'Sally'.
So,
Finally,
The correct answer is:
We cannot be sure of the exact number of sisters for 'Sally'.
Therefore,
The best and most logical conclusion is that:
'It's not possible to say an exact number of sisters for 'Sally'.
So,
Finally,
The correct answer is:
We cannot be sure of the exact number of sisters for 'Sally'.
Hence,
The best and most logical conclusion is that:
'It's not possible to say an exact number of sisters for 'Sally'.
So,
Finally,
The
EDIT: Just realised I keep pasting in the stupid version of this question, but either way this model wasn't much use compared to "Sally has 3 sisters. The information about their brothers is not relevant to the question. So, Sally still has 3 sister", and did show the extreme verbosity problem I had when I last tried this without the attenuation applied.
dtype: float16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 5]
model: miqu-1-70b-sf
- sources:
- layer_range: [5, 10]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [5, 15]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [10, 20]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [15, 25]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [20, 30]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [25, 35]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [30, 40]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [35, 45]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [40, 50]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [45, 55]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [50, 60]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [55, 65]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [60, 70]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [65, 75]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [70, 75]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [75, 80]
model: miqu-1-70b-sf
### USER
Two people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing?
### ASSISTANT
The other person is also operating the machine. The statement "requires two people to operate" implies that both individuals are actively involved in operating the machine.
### USER
Sally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have?
### ASSISTANT
Sally has 3 sisters.
Here's the reasoning:
1. We know that Sally has 3 sisters (from the question).
2. Then we are told that each of her sisters has one brother. Since siblings share the same parents, this means that each of SER sisters also has a brother, making it a total of 4 sisters including SERy own self.
3. Therefore, Sally has 3 biological sisters (including herself) + herself, making it a total of 4 sisters in all.
Note: The question might seem confusing at first glance due to its wording, but upon closer inspection, it becomes clear that each sister mentioned is being counted twice - once as a sibling and once as having a sibling.
Yeah, gibberish: "SER sisters"... But sound logic.
so will retry without going into the first 10 and last 10 blocks. I won't because can't see any nice interleave pattern and don't want to have cube-roots yet...
I'll try it on Wolfram's miqu-1-103b
config instead:
dtype: float16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 40]
model: 152334H/miqu-1-70b-sf
- sources:
- layer_range: [20, 60]
model: 152334H/miqu-1-70b-sf
- sources:
- layer_range: [40, 80]
model: 152334H/miqu-1-70b-sf
-->
dtype: float16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 20]
model: miqu-1-70b-sf
- sources:
- layer_range: [20, 40]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [20, 60]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [40, 60]
model: miqu-1-70b-sf
parameters:
scale:
- filter: q_proj
value: 0.7071067812
- filter: k_proj
value: 0.7071067812
- value: 1
- sources:
- layer_range: [60, 80]
model: miqu-1-70b-sf
### USER
Two people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing?
### ASSISTANT
Based on the information provided, it is not explicitly stated what the other person is doing. They could be observing, assisting, or waiting for their turn to operate the machine. Without additional context or details, it's impossible to determine their exact role or actions with certainty.
### USER
Sally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have?
### ASSISTANT
Sally has 3 sisters. The fact that each of her sisters has one brother does not affect the number of sisters Sally has. Therefore, Sally still has 3 sisters. The information about their brothers is irrelevant to the question asked.
### USER
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
### ASSISTANT
Sally has 2 sisters.
Here's the reasoning:
1. Sally is one of the siblings, making her one of the sisters in question.
2. Each brother has 2 sisters, which means that each brother has Sally and one other sister besides Sally (since they are not counted twice).
3. Therefore, Sally has 2 sisters in total: herself and the other sister shared by her brothers.
### USER
Solve this family relationship quiz: Marie is Emma's parent. Emma is Carl's parent. Carl is Timothy's parent. Timothy is male. Who's Timothy to Marie?
### ASSISTANT
Timothy is Marie's great-grandson.
Explanation:
1. Marie is Emma's parent, so Marie is Emma's mother or father (let's say mother for this example).
2. Emma is Carl's parent, making Emma Carl's mother.
3. Carl is Timothy's parent, making Carl Timothy's father.
4. Therefore, Marie is Timothy's great-grandmother as she is the grandparent of his grandparent (Carl).
### USER
Solve this family relationship quiz: Betty is Brenda's parent. Betty is Joe's parent. Joe is Charles' parent. Brenda is female. Who's Brenda to Charles?
### ASSISTANT
Brenda is Charles' aunt.
Explanation:
1. Betty is Brenda's parent, so Brenda is Betty's child.
2. Betty is also Joe's parent, making Joe Brenda's sibling (either brother or sister).
3. Since Joe is Charles' parent, this makes Brenda Charles' aunt (either aunt by blood or marriage).
### USER
Solve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward?
### ASSISTANT
Timothy is Edward's uncle.
Explanation:
1. Nathan is Timothy's parent, so Timothy is Nathan's child.
2. Aaron is also a child of Nathan (as mentioned in the question), making Aaron a sibling of Timothy (either an older or younger sibling).
3. Aaron is Edward's parent, which means that Edward is Aaron's child.
4. Since Timothy and Aaron are siblings, this makes Timothy an uncle to Edward, who is their nephew.
### USER
Can you summarise the ideas behind Pollack's 1990 "Recursive distributed representations" paper?
### ASSISTANT
Sure! In Michael Pollack's 1990 paper "Recursive distributed representations" published in Artificial Intelligence, he proposed a novel approach to natural language processing (NLP) using recursive distributed representations.
The main idea behind this approach is to represent words or phrases as vectors in a high-dimensional space, where each dimension corresponds to a particular feature of the word or phrase. These vectors are then combined recursively to form more complex representations of phrases, sentences, and even entire documents.
The key innovation of Pollack's approach is the use of recursive compositionality, which allows for the representation of complex linguistic structures by recursively combining simpler representations. This allows for a more compact and efficient representation of language compared to traditional approaches that rely on large feature vectors or rule-based systems.
One important aspect of Pollack's approach is the use of distributed representations, which means that each word or phrase is represented as a vector in a high-dimensional space, rather than as a single point in a discrete feature space. This allows for more expressive and flexible representations of language, as well as more efficient processing and manipulation of linguistic data.
Overall, Pollack's approach to NLP using recursive distributed representations has had a significant impact on the field of natural language processing, and has influenced many subsequent developments in NLP research.
Some better, some worse, but still very good.
I will report back how the goliath-120b
recreation turns out tomorrow.
I have failed so far to get any of the original/big merge makers interested in this so gonna try and re-explain here:
Imagine you are trying to get from one location to another:
If you have an extra opportunity to correct your plan: it doesn't make sense to face in approximately the correct direction and then move the FULL distance, reevaluate and then again face in approximately the correct direction and then move the FULL the distance again!
These 3 ways of looking at what a transformer block is doing all suggest this too:
The (1) paper suggests each block is acting as an "in-context" learner so down-scaling the score matrix is similar to 'damped Newton's method', or even simpler: think of it as the leaning-rate in gradient descent - if you set the leaning rate too high you will overshoot the minima and get the "rattling" phenomena where the error starts bouncing around instead of decreasing.
The (2) paper and more visual blog explanation, suggest each block is acting as an associative memory stored using (exponentially) many local minima. So instead of just trying to jump straight to the bottom of the closest local minima; it might make sense to take a smaller step, reevaluate and take another smaller step by changing the β parameter (have a look at the greyscale pictures for a better idea of what's going on).
The (3) videos, suggest each block is adding vectors to each other (nose to tail) in a (very) high-dimensional vector space, where each point in this high-dimensional vector space has a semantic meaning. So again it makes sense to add together 2 smaller vectors to get to a point (with a certain semantic meaning), rather than two copies of the same vector that will likely overshoot out into the edges of this high-dimensional vector space. Empirically it also seems that this holds: the bigger "chunk" of repeated transformer blocks (ie: 40 instead of 20 or 16) the more "confidently wrong" the model gets.
So now hopefully it makes a bit more sense why we would want to move 1/2 the distance if we have 2 opportunities to reevaluate, 1/3 of the distance if we have 3 opportunities to reevaluate, and so on.
If we just wanted to use PyTorch and the Transformers library directly then it would be pretty easy to add in a β parameter that we could use to re-scale or "attenuate" the score matrix (it actually already does this and scales by the sqrt of the latent dimension anyway!) to take these multiple steps, BUT: if we want to use these models in software like llama.cpp
we can't just add a random parameter like this as each architecture needs to have custom code written for it...
So now looking at how the transformer block works, we see that the score matrix is made by taking lots of dot-products between a latent vector that is first projected using the q_proj
and k_proj
tensors. So we could just scale say q_proj
by 1/n for n repeated blocks and this would have the same effect as introducing a separate β parameter.
The problem with this though is that you have now reduced the magnitudes of values of only the q_proj
tensor and when it comes round to quantizing the whole model, clever quantizations algorithms (or importance matrix calculations) might down-weight the importance of q_proj
in favour of k_proj
!
So to solve this problem we can use the identity:
1/2 = 1/sqrt(2) * 1/sqrt(2)
1/3 = 1/cuberoot(3) * 1/cuberoot(3) * 1/cuberoot(3)
1/n = (1/(n^(1/n)))^n
(using the rule: nth root = x^(1/n)
)
and now if you think about what is happening when we take a dot-product:
[a, b, c] . [x, y, z]
= a*x + b*y + c*z
and then home-in on just looking at a
and x
, which are getting multiplied and added into the sum.
If we want this sum to be half the magnitude we can use the identity above to get:
1/2 * a * x
= 1/sqrt(2)*a * 1/sqrt(2)*x
≈ 0.7071067812*a * 0.7071067812*x
So after all this, hopefully it make sense what is going on and where this number is coming from!
So there are a few caveats:
n
transformer blocks should scale the score matrix by 1/n
exactly. This only holds for the case where the inputs to the softmax are similar and the post-softmax distribution is fairly uniform..bias
for either q_proj
and k_proj
(ie: an Affine transformation) you don't want to rescale this too (think about the initial example of "trying to get from one location to another" to see why).q_proj
and k_proj
tensors a different name (see here for llama).I added this to mergekit/mergekit/merge_methods/passthrough.py
:
import logging
.
.
.
if scale is not None:
logging.info(f"scale: {scale}")
tensor = tensor * scale
.
.
.
and then run mergekit-yaml --verbose
to double check something is being scaled:
You should then see some INFO:root:scale: 0.7071067812
. If everything is INFO:root:scale: 1.0
then you probably have the wrong tensor names for the architecture and need to check here.
So to help demonstrate this idea even more, here are 3 popular frankenmerge model configs that I have adapted, and added extensive comments to show the similarly between the existing configs and this new "attenuated" versions:
###############################
# miqu-1-120b-attenuated.yaml #
###############################
# Use: mergekit-yaml --clone-tensors ./miqu-1-120b-attenuated.yaml ./miqu-1-120b-attenuated
# See: https://huggingface.co/wolfram/miqu-1-120b for original 'miqu-1-120b' layer ranges.
# See: https://github.com/arcee-ai/mergekit/issues/198 for discussion/reasoning behind this idea.
# ---
# The scale factor to use, eg: solve x^2 = 1/2 --> x = 1/sqrt(2) ≈ 0.7071067812
const_tag: &scale_factor 0.7071067812 # 1/sqrt(2)
# The filter parameters of a scaled block.
attenuate-env: &attenuated_env
parameters:
scale:
- filter: q_proj
value: *scale_factor
- filter: k_proj
value: *scale_factor
- value: 1.0
# ---
slices:
###########################
# Block 1: miqu-1 [0, 20] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [0, 10] # The first 10 layers of Block 1 are not duplicated
- sources:
- model: miqu-1-70b-sf
layer_range: [10, 20] # The last 10 layers of Block 1 are are duplicated twice
<<: *attenuated_env
###########################
# Block 2: miqu-1 [10, 30] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [10, 30]
<<: *attenuated_env
###########################
# Block 3: miqu-1 [20, 40] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [20, 40]
<<: *attenuated_env
###########################
# Block 4: miqu-1 [30, 50] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [30, 50]
<<: *attenuated_env
###########################
# Block 5: miqu-1 [40, 60] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [40, 60]
<<: *attenuated_env
###########################
# Block 6: miqu-1 [50, 70] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [50, 70]
<<: *attenuated_env
##########################
# Block 7: miqu-1 [60, 80] #
##########################
- sources:
- model: miqu-1-70b-sf
layer_range: [60, 70] # The first 10 layers of Block 7 are are duplicated twice
<<: *attenuated_env
- sources:
- model: miqu-1-70b-sf
layer_range: [70, 80] # The last 10 layers of Block 7 are not duplicated
merge_method: passthrough
dtype: float16
################################
# goliath-120b-attenuated.yaml #
################################
# Use: mergekit-yaml ./goliath-120b-attenuated.yaml ./goliath-120b-attenuated
# See: https://huggingface.co/alpindale/goliath-120b for original 'goliath-120b' layer ranges.
# See: https://github.com/arcee-ai/mergekit/issues/198 for discussion/reasoning behind this idea.
# ---
# The scale factor to use, eg: solve x^2 = 1/2 --> x = 1/sqrt(2) ≈ 0.7071067812
const_tag: &scale_factor 0.7071067812 # 1/sqrt(2)
# The filter parameters of a scaled block.
attenuate-env: &attenuated_env
parameters:
scale:
- filter: q_proj
value: *scale_factor
- filter: k_proj
value: *scale_factor
- value: 1.0
# ---
slices:
#########################
# Block 1: Xwin [0, 16] #
#########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [0, 8] # The first 8 layers of Block 1 are not duplicated
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [8, 16] # The last 8 layers of Block 1 are are duplicated twice
<<: *attenuated_env
############################
# Block 2: Euryale [8, 24] #
############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [8, 16]
<<: *attenuated_env
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [16, 17] # layer 16 is not duplicated
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [17, 24]
<<: *attenuated_env
##########################
# Block 3: Xwin [17, 32] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [17, 24]
<<: *attenuated_env
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [24, 25] # layer 24 is not duplicated
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [25, 32]
<<: *attenuated_env
#############################
# Block 4: Euryale [25, 40] #
#############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [25, 32]
<<: *attenuated_env
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [32, 33] # layer 32 is not duplicated
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [33, 40]
<<: *attenuated_env
##########################
# Block 5: Xwin [33, 48] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [33, 40]
<<: *attenuated_env
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [40, 41] # layer 40 is not duplicated
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [41, 48]
<<: *attenuated_env
#############################
# Block 6: Euryale [41, 56] #
#############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [41, 48]
<<: *attenuated_env
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [48, 49] # layer 48 is not duplicated
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [49, 56]
<<: *attenuated_env
##########################
# Block 7: Xwin [49, 64] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [49, 56]
<<: *attenuated_env
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [56, 57] # layer 56 is not duplicated
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [57, 64]
<<: *attenuated_env
#############################
# Block 8: Euryale [57, 72] #
#############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [57, 64]
<<: *attenuated_env
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [64, 65] # layer 64 is not duplicated
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [65, 72]
<<: *attenuated_env
##########################
# Block 9: Xwin [65, 80] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [65, 72] # The first 7 layers of Block 9 are are duplicated twice
<<: *attenuated_env
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [72, 80] # The last 8 layers of Block 9 are not duplicated
merge_method: passthrough
dtype: float16
######################################
# WinterGoliath-123b-attenuated.yaml #
######################################
# Use: mergekit-yaml ./WinterGoliath-123b-attenuated.yaml ./WinterGoliath-123b-attenuated
# See: https://huggingface.co/ChuckMcSneed/WinterGoliath-123b for original 'WinterGoliath-123b' layer ranges.
# See: https://github.com/arcee-ai/mergekit/issues/198 for discussion/reasoning behind this idea.
# ---
# The scale factor to use, eg: solve x^2 = 1/2 --> x = 1/sqrt(2) ≈ 0.7071067812
const_tag: &scale_factor 0.7071067812 # 1/sqrt(2)
# The filter parameters of a scaled block.
attenuate-env: &attenuated_env
parameters:
scale:
- filter: q_proj
value: *scale_factor
- filter: k_proj
value: *scale_factor
- value: 1.0
# ---
slices:
#########################
# Block 1: Xwin [0, 16] #
#########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [0, 8] # The first 8 layers of Block 1 are not duplicated
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [8, 16] # The last 8 layers of Block 1 are are duplicated twice
<<: *attenuated_env
##################################
# Block 2: WinterGoddess [8, 24] #
##################################
- sources:
- model: WinterGoddess-1.4x-70B-L2
layer_range: [8, 24]
<<: *attenuated_env
##########################
# Block 3: Xwin [16, 32] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [16, 32]
<<: *attenuated_env
###################################
# Block 4: WinterGoddess [24, 40] #
###################################
- sources:
- model: WinterGoddess-1.4x-70B-L2
layer_range: [24, 40]
<<: *attenuated_env
##########################
# Block 5: Xwin [32, 48] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [32, 48]
<<: *attenuated_env
###################################
# Block 6: WinterGoddess [40, 56] #
###################################
- sources:
- model: WinterGoddess-1.4x-70B-L2
layer_range: [40, 56]
<<: *attenuated_env
##########################
# Block 7: Xwin [48, 64] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [48, 64]
<<: *attenuated_env
###################################
# Block 8: WinterGoddess [56, 72] #
###################################
- sources:
- model: WinterGoddess-1.4x-70B-L2
layer_range: [56, 72]
<<: *attenuated_env
##########################
# Block 9: Xwin [64, 80] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [64, 72] # The first 8 layers of Block 9 are are duplicated twice
<<: *attenuated_env
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [72, 80] # The last 8 layers of Block 9 are not duplicated
merge_method: passthrough
dtype: float16
IMPORTANT: It seem that for merging different models, that scaling the score matrix by 0.5 (and hence q_proj
and k_proj
tensors by 1/sqrt(2)
) may be too little! This worked OK for the self-merged miqu-1-120b
, but my tests with goliath-120b
show it introduces repetition problems. To set to a custom value use this:
Reduce[{x^2 == 0.5, x > 0}, x, Reals]
Replace 0.5 with whatever value in the range [0.5, 1.0]
you want and click "Approximate Form" button.
If you want to experiment with duplicating 3 blocks:
Reduce[{x^3 == 0.333, x > 0}, x, Reals]
Replace 0.333 with whatever value in the range [0.333, 1.0]
you want and click "Approximate Form" button.
and so on...
@cg123 @shamanez
Is there any way I can simplify these complex looking yaml
files more or some better way to define this in mergekit? It's making the whole idea look much more complex than it really is and likely to put most people off from trying it (not to mention; it's incredibly error prone trying to define stuff like golaith-120b
like this!).
Very interesting, thank you for your research and sharing the results. I am still trying to get my head around it, which I am finding a bit difficult with the limited time that I have at the moment to work on this. Maybe I can apply it to WestLake-v2-10.7b, which should allow me to benchmark it relatively quickly and compare to the original.
Are you planning to publish the 103b self-merge of miqu-1 using this method?
Very interesting, thank you for your research and sharing the results. I am still trying to get my head around it, which I am finding a bit difficult with the limited time that I have at the moment to work on this. Maybe I can apply it to WestLake-v2-10.7b, which should allow me to benchmark it relatively quickly and compare to the original.
Are you planning to publish the 103b self-merge of miqu-1 using this method?
I can't upload the merged model easily, as only have around 1mb/s upload on VDSL :/
But this is what the yaml
file would be for miqu-1-103b
:
###############################
# miqu-1-103b-attenuated.yaml #
###############################
# Use: mergekit-yaml --clone-tensors ./miqu-1-103b-attenuated.yaml ./miqu-1-103b-attenuated
# See: https://huggingface.co/wolfram/miqu-1-103b for original 'miqu-1-103b' layer ranges.
# See: https://github.com/arcee-ai/mergekit/issues/198 for discussion/reasoning behind this idea.
# ---
# The scale factor to use, eg: solve x^2 = 1/2 --> x = 1/sqrt(2) ≈ 0.7071067812
const_tag: &scale_factor 0.7071067812 # 1/sqrt(2)
# The filter parameters of a scaled block.
attenuate-env: &attenuated_env
parameters:
scale:
- filter: q_proj
value: *scale_factor
- filter: k_proj
value: *scale_factor
- value: 1.0
# ---
slices:
###########################
# Block 1: miqu-1 [0, 40] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [0, 20] # The first 20 layers of Block 1 are not duplicated
- sources:
- model: miqu-1-70b-sf
layer_range: [20, 40] # The last 20 layers of Block 1 are are duplicated twice
<<: *attenuated_env
###########################
# Block 2: miqu-1 [20, 60] #
###########################
- sources:
- model: miqu-1-70b-sf
layer_range: [20, 60] # All the layers of Block 2 are are duplicated twice
<<: *attenuated_env
##########################
# Block 3: miqu-1 [40, 80] #
##########################
- sources:
- model: miqu-1-70b-sf
layer_range: [40, 60] # The first 20 layers of Block 3 are are duplicated twice
<<: *attenuated_env
- sources:
- model: miqu-1-70b-sf
layer_range: [60, 80] # The last 20 layers of Block 3 are not duplicated
merge_method: passthrough
dtype: float16
To create the merge:
mergekit-yaml --clone-tensors ./miqu-1-103b-attenuated.yaml ./miqu-1-103b-attenuated
If you wanted to make your own GGUF quants (assuming you have llama.cpp
cloned and built) then this too:
./convert.py miqu-1-103b-attenuated --outfile miqu-1-103b-attenuated-f16.gguf --outtype f16
./imatrix --chunks 200 -m miqu-1-103b-attenuated-f16.gguf -f groups_merged.txt -o miqu-1-103b-attenuated-f16.imatrix
./quantize --imatrix miqu-1-103b-attenuated-f16.imatrix miqu-1-103b-attenuated-f16.gguf miqu-1-103b-attenuated-q5_K_M.gguf Q5_K_M 12
-ngl
option to the imatrix
step or skip it altogether if you don't care.Q5_K_M
to whatever you want.quantize
seems to have a livelock bug currently if you use too many).I can't upload the merged model easily, as only have around 1mb/s upload on VDSL :/
Understood. I don't have much upload bandwidth either, but at least a bit more. I have just ran the merge for miqu-1-120b-attenuated, and I am in the middle of converting it and quantising. I will upload the Q4_KS version to froggeric/miqu-1-120b-attenuated-GGUF.
Thanks for the 103b yaml.
I can't upload the merged model easily, as only have around 1mb/s upload on VDSL :/
Understood. I don't have much upload bandwidth either, but at least a bit more. I have just ran the merge for miqu-1-120b-attenuated, and I am in the middle of converting it and quantising. I will upload the Q4_KS version to froggeric/miqu-1-120b-attenuated.
Thanks for the 103b yaml.
No problem - and if you get chance; try using different (larger) values of scale_factor
and see what difference it makes (I only tested those puzzles yesterday). There isn't really a good reason it should be exactly 1/sqrt(2) ≈ 0.7071067812
. The actual optimal value will lie somewhere in the range [0.7071067812, 1]
.
For the 103b self-merge, I would actually be more interested in adapting the following recipe instead:
slices:
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [0,11]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [9,13]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [11,15]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [13,17]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [15,23]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [21,25]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [23,49]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [47,51]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [49,53]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [51,55]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [53,57]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [55,59]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [57,61]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [59,63]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [61,65]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [63,67]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [65,69]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [67,71]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [69,73]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [71,75]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [73,80]
merge_method: passthrough
dtype: float16
For the 103b self-merge, I would actually be more interested in adapting the following recipe instead:
slices: - sources: - model: 152334H/miqu-1-70b-sf layer_range: [0,11] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [9,13] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [11,15] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [13,17] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [15,23] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [21,25] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [23,49] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [47,51] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [49,53] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [51,55] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [53,57] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [55,59] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [57,61] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [59,63] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [61,65] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [63,67] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [65,69] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [67,71] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [69,73] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [71,75] - sources: - model: 152334H/miqu-1-70b-sf layer_range: [73,80] merge_method: passthrough dtype: float16
After doing the goliath-120b
version I don't think my brain could handle this! :rofl:
You would need to try to work out how many times each layer is getting duplicated first - probably best to look at the original goliath-120b
model and compare with what I did to see what's needed.
Hopefully @cg123 or @shamanez might get back to us to see if there is a simpler way to do this!
I'm going to recreate the miqu-1-120b-attenuated
version now and import it into Ollama with the proper miqu-1
templated to experiment with tomorrow.
Omg this is one way to merge 😁.
On Fri, 19 Apr 2024 at 8:33 AM, jukofyork @.***> wrote:
For the 103b self-merge, I would actually be more interested in adapting the following recipe instead:
slices:
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [0,11]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [9,13]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [11,15]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [13,17]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [15,23]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [21,25]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [23,49]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [47,51]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [49,53]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [51,55]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [53,57]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [55,59]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [57,61]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [59,63]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [61,65]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [63,67]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [65,69]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [67,71]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [69,73]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [71,75]
- sources:
- model: 152334H/miqu-1-70b-sf layer_range: [73,80]merge_method: passthroughdtype: float16
After doing the goliath-120b version I don't think my brain could handle this! 🤣
You would need to try to work out how many times each layer is getting duplicated first - probably best to look at the original goliath-120b model and compare with what I did to see what's needed.
Hopefully @cg123 https://github.com/cg123 or @shamanez https://github.com/shamanez might get back to us to see if there is a simpler way to do this!
— Reply to this email directly, view it on GitHub https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2065270818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUBTGBNLN36H4YDOXLY6AU2TAVCNFSM6AAAAABE2TYM6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGI3TAOBRHA . You are receiving this because you were mentioned.Message ID: @.***>
Nice, you found my secret scale parameter. :) This is really interesting, it's great to see actual results coming out of this! I'm out of town at the moment so I can't really dig into this for a while. I'll be watching with interest for sure though.
I'd definitely recommend writing a little python script or something to generate these configs.
Nice, you found my secret scale parameter. :) This is really interesting, it's great to see actual results coming out of this! I'm out of town at the moment so I can't really dig into this for a while. I'll be watching with interest for sure though.
There's also some discussion in this thread too: https://huggingface.co/wolfram/miqu-1-120b/discussions/4
divinetaco (who has been running grid-search to find optimal merge parameters) has also confirmed it doesn't seem to work very well for non-self-merges:
I quickly experimented with downscaling on https://huggingface.co/divinetaco/aranea-tenebris-120b-v1.0 values tried: 0.5, 0.6, 0.8, 0.9
Perplexity scores in my benchmarks didn't shift, but it did modify the responses on manual inspection, which is interesting. Using lower values (0.5,0.6) degraded response quality. It was harder to judge with (0.8,0.9). It may have improved. Will do further testing.
doesn't seem to work well for the non-self-merges
The lower values might be especially bad in tenebris due to the irregular merge pattern.Seeing Miquliz's merge config again did jog my memory regarding wolfram's trick with the 0 weight merge on first / final layer - got some more nice improvements applying it to tenebris. I might have to upload a v1.1 once I am done toying with projection scale values.
Hopefully somebody can work out how to apply the "0 weight merge on first / final layer" thing in case that is important.
I'm away for a few days now, but here's a couple more ideas to experiment with:
const_tag: &attn_scale_factor 0.7071067812
const_tag: &out_scale_factor 0.5
attenuate-env: &attenuated_env
parameters:
scale:
- filter: q_proj
value: *attn_scale_factor
- filter: k_proj
value: *attn_scale_factor
- filter: down_proj
value: *out_scale_factor
- value: 1.0
The down_proj
tensor is renamed as post_mlp
in architectures/llama.json, so as long as it is just a linear projection (without any non-linearity after it) this should work and (hopefully) have the desired effect of adding less into the "residual highway":
If there is a non-linearity after down_proj
then it will likely completely destroy the model so easy to check...
You probably don't want to scale both attn_scale_factor
and out_scale_factor
too aggressively together (maybe just leave attn_scale_factor = 1
to start with), and also double check there is no down_proj.bias
if using a different architecture to llama
.
The next thing to try experimenting with for non-self-merges (where this doesn't seem to be working well), is to not scale the "donor" models' q_proj
and k_proj
tensors, eg:
const_tag: &scale_factor 0.7071067812 # 1/sqrt(2)
# The filter parameters of a scaled block.
attenuate-env: &attenuated_env
parameters:
scale:
- filter: q_proj
value: *scale_factor
- filter: k_proj
value: *scale_factor
- value: 1.0
# ---
slices:
#########################
# Block 1: Xwin [0, 16] #
#########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [0, 16]
############################
# Block 2: Euryale [8, 24] #
############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [8, 24]
<<: *attenuated_env
##########################
# Block 3: Xwin [17, 32] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [17, 32]
#############################
# Block 4: Euryale [25, 40] #
#############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [25, 40]
<<: *attenuated_env
##########################
# Block 5: Xwin [33, 48] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [33, 48]
#############################
# Block 6: Euryale [41, 56] #
#############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [41, 56]
<<: *attenuated_env
##########################
# Block 7: Xwin [49, 64] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [49, 64]
#############################
# Block 8: Euryale [57, 72] #
#############################
- sources:
- model: Euryale-1.3-L2-70B
layer_range: [57, 72]
<<: *attenuated_env
##########################
# Block 9: Xwin [65, 80] #
##########################
- sources:
- model: Xwin-LM-70B-V0.1
layer_range: [65, 80]
merge_method: passthrough
dtype: float16
and see if that helps.
I just finished evaluating miqu-1-120b-attenuated-GGUF against my LLM Creativity benchmark and it performs slightly better than the non-attenuated version, taking the second spot in the current leaderboard. I noticed some improvements at creative writing, producing longer, more details, and unrushed text. Like wolfram/miqu-1-120b though, there is some degradation over miqu-1-70b, with longer text, as it starts deviating from instructions and requires some effort to keep it on track. The scores show some improvement in all categories.
This is a fascinating discussion. I am cooking up an attenuated version of Midnight-Miqu-103B-v1.5 as I type this. I'll be curious to see how it performs.
I see ChuckMcSneed has found the same problem with trying this on goliath-120b
so it's looking like for multi-model merges there might need to be something else.
I think the first things to try would be to increase the scale-factor
right up (remembering that the scaling effect on the score matrix will be this squared though!):
scale-factor = 1/sqrt(2)
) doesn't really have any good theoretical reason for being optimal (for self-merges or multi-model merges alike). It's probably a good lower bound though and less (for 2x duplicating layers at least) is unlikely to make anything useful, but:It would probably be interesting to try values < 1 and values > 1 for a model that is just a straight copy (ie: use pass through
but just one model and all layers) just to see what effect it has and aid understanding of the failure modes from this method.
const_tag: &attn_scale_factor 0.7071067812 const_tag: &out_scale_factor 0.5
attenuate-env: &attenuated_env parameters: scale:
- filter: q_proj value: *attn_scale_factor
- filter: k_proj value: *attn_scale_factor
- filter: down_proj value: *out_scale_factor
- value: 1.0
I am going to give it a try with miqu-1 120b self-merge, and run it through my benchmark.
Results of testing down_pro to 0.5: output is 100% identical to the self-merge without attenuating it.
Only have my tablet here so using Claude Opus to format this but it seems to write invalid Latex/Katex for Github :(
Let's analyze how the expected value of the L2 norm of the sum of two high-dimensional random unit vectors changes as the correlation between them increases from 0 to 1.
Let $\mathbf{a}$ and $\mathbf{b}$ be two high-dimensional random vectors with $|\mathbf{a}|_2 = |\mathbf{b}|_2 = 1$, and let $\mathbf{c} = \mathbf{a} + \mathbf{b}$.
The L2 norm of $\mathbf{c}$ is given by:
$$|\mathbf{c}|2 = \sqrt{\sum{i=1}^n (a_i^2 + 2a_ib_i + b_i^2)}$$
Let's denote the dot product of $\mathbf{a}$ and $\mathbf{b}$ as $\rho = \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^n a_ib_i$. This value represents the correlation between the two vectors and ranges from 0 to 1.
Substituting $\rho$ into the equation:
$$|\mathbf{c}|_2 = \sqrt{2 + 2\rho}$$
Now, let's consider the extreme cases:
When $\rho = 0$, the vectors are uncorrelated:
$$E[|\mathbf{c}|_2] = \sqrt{2}$$
When $\rho = 1$, the vectors are perfectly positively correlated (i.e., $\mathbf{a} = \mathbf{b}$):
$$E[|\mathbf{c}|_2] = \sqrt{4} = 2$$
For values of $\rho$ between 0 and 1, the expected value of $|\mathbf{c}|_2$ will vary continuously between $\sqrt{2}$ and 2.
$$E[|\mathbf{c}|_2] = \sqrt{2 + 2\rho}$$
In summary, as the correlation between the two high-dimensional random unit vectors increases from 0 to 1, the expected value of the L2 norm of their sum varies from $\sqrt{2}$ to 2, with a value of $\sqrt{2}$ when they are uncorrelated ($\rho = 0$).
So basically what this saying is:
So putting this back into the value in the above yaml
files that scales Q and K (confusingly we have another sqrt(2) now lol):
This all should be taken with a grain of salt as in the actual model is very different to this simple example, but it is probably a good place to start playing with different scale factors to try to get it working with multi-model merges.
Sorry for the bad tablet formatting - autocorrect is painful for stuff like this, but thought it was worth posting before I get back whilst people are currently experimenting with this idea still.
Results of testing down_pro to 0.5: output is 100% identical to the self-merge without attenuating it.
Did you add the logging output and run with --verbose
? This could just be a naming mistake: down_proj
is what is should be and with the logging line added you should see some 0.5 in the logging output.
See this post for the logging alteration:
https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2063716974
class LlamaMLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
if self.config.pretraining_tp > 1:
slice = self.intermediate_size // self.config.pretraining_tp
gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
up_proj_slices = self.up_proj.weight.split(slice, dim=0)
down_proj_slices = self.down_proj.weight.split(slice, dim=1)
gate_proj = torch.cat(
[F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
)
up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
down_proj = [
F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
]
down_proj = sum(down_proj)
else:
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj
The key line is this:
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
and:
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
So down_proj
is (as hoped) applied after the non-linearly act_fn
and then added back into the residual stream.
So this should definitely do something and as far as I can see, is the only place we can apply this scaling: all the other places look like they will end up scaling V and/or end up scaling the inputs to the MLP non-linear activation functions.
When I get back home I'll look into seeing if I can get this to actually run and then should be able to get some interesting diagnostic stuff printed out for the actual transformer blocks we are dealing with, instead of just having to guess at it.
class LlamaRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
LlamaRMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
@froggeric maybe this is why it had no effect? It looks like it actually calculates the standard deviation of the actual input instance's values and then uses this to rescale:
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
rather than using the weight as a standard deviation it found during training (as I had nievely assumed). @cg123 can probably confirm this?
If so then I don't think there will be any place to change the scale like this without altering the transformer code and in turn making the model impossible to quantize using llama.cpp or other software :/
This is an interesting idea! I've played around with downscaling
o_proj
/down_proj
to just change the magnitude of the residual from layers but I'd definitely like to see what happens here.
@cg123 does changing the down_proj
have an effect or does the norm layer just cancel it out?
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
**kwargs,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
This still looks like it should do something as hidden_states = residual + hidden_states
then feeds into the next block: hidden_states = self.input_layernorm(hidden_states)
.
The only way I can see it would do nothing is if hidden_states
has a mean of zero?
@froggeric I'd leave playing with down_proj
for now - until @cg123 confirms it does anything or I see if I can hack in some printout diagnostics.
From: https://arxiv.org/abs/2309.00071
In Section 3.4, the authors introduce a temperature factor $t$ that scales the logits before the attention softmax:
$$\text{Attention Weights} = \text{softmax} \left( \frac{\mathbf{q}^\top \mathbf{k}}{t\sqrt{|D|}} \right)$$
This attention scaling has a uniform impact on perplexity, regardless of the data sample and token position over the extended context window.
The authors reparameterize RoPE as 2D matrices, allowing them to scale both $\mathbf{q}$ and $\mathbf{k}$ by $\sqrt{1/t}$ by simply scaling the complex RoPE embeddings. This has zero overhead during inference and training since RoPE embeddings are pre-computed.
In Appendix A.2, the authors analyze the impact of $1/\sqrt{t}$ on perplexity using 896 16k-token documents from RedPajama. They recommend the following value for $t$:
$$\sqrt{1/t} = 0.1 \ln(s) + 1$$
where $s$ is the scale extension factor used in YaRN's interpolation methods.
The authors observe that a suitable $t$ can improve perplexity scores across the extended context window, and the best value of $t$ is mostly consistent across different samples and token positions.
Sorry for the AI summary but it saves me trying to write that lot out with autocorrect 😁
It's interesting they also atttenuate the softmax but doesn't seem any real theory behind it; other than it was found to work empirically.
This would also be useful for a method described in a paper
I can't find atm (related to thewhere a single transformer block is duplicated and placed after the original, and then 2 of the weight matrices in the block (one before and one after IIRC) are set to zero with the intension of making the whole block just perform the identity function and pass straight through. The idea being that this allows (especially instruction tuned) models to avoid catastrophic forgetting (somebody used this to upscale Mistral but again I can't find that model now).Solar 10.7b
paper but not it)
Found it again:
https://arxiv.org/abs/2401.02415
So this can be implemented now in mergekit by setting the scale
parameter of o_proj
and down_proj
to zero for the duplicated "identity initialised" layers.
It will require fine-tuning but I think it's an interesting idea for merging methods and would be a good way to show off the use of the pass through scale
parameter in an example yaml file.
I'm pretty sure I've completely cracked this now and just running a miqu-1
self merge to be sure my numbers work out and I'll make detailed post then.
Whilst I'm waiting for the model to finish cooking, I'll start with these 2 images from the:
The Unreasonable Ineffectiveness of the Deeper Layers
paper @cg123 implemented to prune back llama-70b
yesterday:
I've marked on in red the bounds of where there is a clear transition in the early layers for the lllama-70b
models. Notice how this approximates to the "first 8-10 layers can't be overlapped" phenomenon in nearly all frankenmerges to date?
I've also marked in orange where there is a similar but much fainter transition in the last layers, but there is also an interesting bit of blue on the diagonal than consists of 2-3 layers at the end. I know people nearly always don't overlap the last 10 layers but when I was messing about with codellama-33b
and its fine-tunes I found that it was nearly coherent with only the last 2 layers left and almost completely coherent (with the odd random bit of WTF code inserted per reply) with 3 layers left.
I strongly suspect these are the places where the model in transitioning in/out of its latent space representation and should definitely not be messed with.
So this leaves everything between these layers as likely the place where 3blue1brown's "addition of vectors interpretation" makes sense (ie: the residual stream hypothesis), and this in turn leads us to scaling o_proj
and down_proj
:
From: LLaMA Pro: Progressive LLaMA with Block Expansion
So now forgetting about the QK-attenuation stuff from earlier in the thread, the question is: why are we overlapping/interleaving the blocks of layers? Could it be that this just happens to add lots of noise to the vectors being added and is a sort of accidental/rudimentary way of scaling the vectors to avoid overshooting!?
So if that is the case, and we can actually work out the correct scale factor for o_proj
and down_proj
, then it actually makes much more sense to just duplicate alternate layers like in the LLaMA Pro: Progressive LLaMA with Block Expansion paper (being careful to avoid messing with the first ~10 layers for reasons explained above):
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
.
.
.
So using the numbers I found a few posts ago for uncorrelated vectors (ie: multi-model merges), we get:
Scale: o_proj = down_proj = sqrt(1/2) ≈ 0.7071067812
and with the attenuation modification inspired by the Hopfield Networks is All You Need and Transformers Learn Higher-Order Optimization Methods for In-Context Learning papers linked in the OP), we get:
Scale: q_proj = k_proj = sqrt(sqrt(1/2)) ≈ 0.8408964153
and using these numbers goliath-test
is born:
# > mergekit-yaml --verbose --cuda goliath-test.yaml goliath-test
# > ~/LLMs/llama.cpp/convert.py goliath-test --outfile goliath-test-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m goliath-test-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o goliath-test-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix goliath-test-f16.imatrix goliath-test-f16.gguf goliath-test-q5_K_M.gguf Q5_K_M 12
# The models we are going to use.
const_tag: &MODEL1 Xwin-LM-70B-V0.1
const_tag: &MODEL2 Euryale-1.3-L2-70B
# The amount to scale the residual contributions and the amount to attenuate the Q and K matrices.
# Correlation = 1 --> RESIDUAL_SCALE_FACTOR = 1/2 [≈ 0.5] & QK_ATTENUATION_FACTOR = sqrt(1/2) [≈ 0.7071067812]
# Correlation = 0 --> RESIDUAL_SCALE_FACTOR = sqrt(1/2) [≈ 0.7071067812] & QK_ATTENUATION_FACTOR = sqrt(sqrt(1/2)) [≈ 0.8408964153]
const_tag: &RESIDUAL_SCALE_FACTOR 0.7071067812 # Set to 1.0 to leave unchanged.
const_tag: &QK_ATTENUATION_FACTOR 0.8408964153 # Set to 1.0 to leave unchanged.
# How we are going to scale each matched tensor type.
scale-filter-env: &scale_filter_env
parameters:
scale:
- filter: o_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [0, 10]
# BASH: for i in {10..69}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *scale_filter_env\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *scale_filter_env"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *scale_filter_env
# The last 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [70, 80]
merge_method: passthrough
dtype: float16
Who appears to be completely coherent!!!!
Write me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small mirror. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert.
The sun beat down mercilessly on the parched earth, casting a shimmering haze over the desolate landscape. In the distance, the crumbling remains of a once-thriving town stood as a testament to the passage of time and the cruelty of the desert. The wind howled through the empty streets, carrying with it the whispers of ghosts long forgotten.
Dust swirled in the breeze as a lone figure stumbled into view, his once-yellow robes now stained and tattered from days of relentless travel. His name was long forgotten, erased from his memory by the cult that had raised him, but he knew he could no longer be a part of their twisted beliefs. He was a man on the run, a heretic in the eyes of his former brethren, and he would do anything to escape their clutches.
The cultist, now a fugitive, entered the ghost town, his eyes darting warily from side to side. He knew that his pursuers couldn't be far behind, and he needed to find shelter before they caught up to him. The sun was setting, casting long shadows across the abandoned streets, and he knew that nightfall would bring with it a different kind of danger.
As he walked, he fingered the small, mysterious mirror tucked securely in his robes. It was the only possession he had managed to take with him when he fled the cult's stronghold, and he had no idea of its significance. All he knew was that it was important enough for the cult to send their most skilled assassins after him.
The cultist ducked into the first building he came across, a ramshackle tavern that looked as if it had been abandoned in a hurry. Dust covered every surface, and the stale smell of old ale and despair hung heavy in the air. He moved cautiously through the dimly lit interior, his heart pounding in his chest, and climbed the creaking stairs to the upper floor.
In a room at the end of the hall, he found a small, dust-covered mirror hanging crookedly on the wall. The cultist removed the mirror from his robes and held it up to the flickering light of the candle he had lit. As he peered into the reflective surface, he gasped in shock. The face staring back at him was not his own, but that of a stranger.
His thoughts raced as he tried to make sense of the impossible. How could this be? He had no memories of his life before the cult, and yet here he was, looking at the face of a stranger. The sound of footsteps on the stairs jolted him from his reverie, and he knew he had no time to ponder the mystery now.
Quickly, he extinguished the candle and crept to the window, peering through the grimy glass. Below, he saw the unmistakable forms of his pursuers, their hooded figures casting long shadows in the moonlight. He knew they would stop at nothing until they had recovered the mirror and brought him back to face the cult's justice.
As he watched, a plan began to form in his mind. He would use the darkness and the abandoned town to his advantage, leading his pursuers on a deadly game of cat and mouse. He was no longer the naive initiate they had indoctrinated; he was a survivor, and he would do whatever it took to stay alive.
The cultist, now a fugitive on the run, slipped into the shadows, his heart pounding in his chest. The game had only just begun.
Can you compare and contrast the writing styles of Joe Abercrombie and Rob J Hayes? They seem very similar to me...
Joe Abercrombie and Rob J. Hayes are both popular fantasy authors known for their gritty and dark storytelling. While their writing styles share some similarities, there are also some key differences.
In summary, while both Abercrombie and Hayes write dark, gritty fantasy with complex characters and vivid worlds, their writing styles differ in terms of pacing, humor, and the level of detail in their worldbuilding and character development.
Can you give a quick summary of the most popular 2-3 works of each for me?
Joe Abercrombie:
Rob J. Hayes:
So again using the numbers I found a few posts ago for correlated vectors (ie: self-merges), we get:
Scale: o_proj = down_proj = 0.5
and with the attenuation modification, we get:
Scale: q_proj = k_proj = sqrt(1/2) ≈ 0.7071067812
and using these numbers we miqu-test
is born:
# > mergekit-yaml --verbose --cuda --clone-tensors miqu-test.yaml miqu-test
# > ~/LLMs/llama.cpp/convert.py miqu-test --outfile miqu-test-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m miqu-test-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o miqu-test-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix miqu-test-f16.imatrix miqu-test-f16.gguf miqu-test-q5_K_M.gguf Q5_K_M 12
# The models we are going to use.
const_tag: &MODEL1 miqu-1-70b-sf
const_tag: &MODEL2 miqu-1-70b-sf
# The amount to scale the residual contributions and the amount to attenuate the Q and K matrices.
# Correlation = 0 --> RESIDUAL_SCALE_FACTOR = sqrt(1/2) [≈ 0.7071067812] & QK_ATTENUATION_FACTOR = sqrt(sqrt(1/2)) [≈ 0.8408964153]
# Correlation = 1 --> RESIDUAL_SCALE_FACTOR = 1/2 [≈ 0.5] & QK_ATTENUATION_FACTOR = sqrt(1/2) [≈ 0.7071067812]
const_tag: &RESIDUAL_SCALE_FACTOR 0.5 # Set to 1.0 to leave unchanged.
const_tag: &QK_ATTENUATION_FACTOR 0.7071067812 # Set to 1.0 to leave unchanged.
# How we are going to scale each matched tensor type.
scale-filter-env: &scale_filter_env
parameters:
scale:
- filter: o_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [0, 10]
# BASH: for i in {10..69}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *scale_filter_env\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *scale_filter_env"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *scale_filter_env
# The last 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [70, 80]
merge_method: passthrough
dtype: float16
Who seems pretty smart:
Two people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing?
The other person is also operating the machine. The statement "requires two people to operate" implies that both individuals are actively involved in operating the machine. Therefore, the other person is not just present in the room but is actively participating in operating the machine as well.
Sally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have?
Sally has 3 sisters. The information about each sister having a brother does not affect the number of sisters Sally has. Therefore, Sally still has 3 sisters. The information about their brothers does not change the number of sisters Sally has.
Confusion might arise because the question mentions brothers, but it is not relevant to the question at hand, which is about the number of sisters Sally has. The number of brothers or brothers-in-law does not change the number of sisters Sally has.
So, Sally has 3 sisters.
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
Sally has 2 sisters.
Here's the reasoning:
Therefore, Sally has 2 sisters in total, including herself. However, this question can be confusing because it might lead some people to think that each brother has a different set of sisters, but in fact, they share the same sisters. So, Sally has 1 sister in addition to herself.
Confidence: 99%
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Think step by step as this is a tricky question:
Let's break this down:
Therefore, Sally has 1 sister.
Confusion might arise because the question mentions that each brother has 2 sisters, but it's important to remember that Sally is also counted as one of the sisters.
Final answer: Sally has 1 sister.
Solve this family relationship quiz: Marie is Emma's parent. Emma is Carl's parent. Carl is Timothy's parent. Timothy is male. Who's Timothy to Marie?
Timothy is Marie's grandchild.
Here's the reasoning:
From these relationships, we can deduce that:
Therefore, Timothy is Marie's grandchild.
Solve this family relationship quiz: Betty is Brenda's parent. Betty is Joe's parent. Joe is Charles' parent. Brenda is female. Who's Brenda to Charles?
Brenda is Charles' aunt.
Here's the reasoning:
Solve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward?
Timothy is Edward's grandfather.
Here's the reasoning:
Therefore, Timothy is Edward's grandfather. However, the question asks for Timothy's relationship to Edward, so the answer is that Timothy is Edward's grandfather.
Solve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward? Think step by step as this is a tricky problem:
Let's break this down step by step:
So, Timothy is Edward's uncle.
Final answer: Timothy is Edward's uncle.
I can't really comment further as have no real principled way to test these model configs other than for coherency, story telling and these handful of puzzles... Hopefully others can do that and also see if the configs can be tweaked to work even better.
So I think I'm largely done with this now. I only have 2 requests for people who want to carry on experimenting:
goliath-test
and miqu-test
:grin: Big thanks to @cg123 for adding the scale
parameter - this has been a lot of fun to work on!
Triple merging seems to work OK too. Sadly I couldn't get the 200b mix of 3 models to work in llama.cpp:
GGML_ASSERT: llama.cpp/ggml.c:16521: i != GGML_HASHTABLE_FULL
So had to just let the last 30 layers be left unmerged to get a 160b model:
# > mergekit-yaml --verbose --cuda tri-test.yaml tri-test
# > ~/LLMs/llama.cpp/convert.py tri-test --outfile tri-test-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m tri-test-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o tri-test-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix tri-test-f16.imatrix tri-test-f16.gguf tri-test-q3_K_M.gguf Q3_K_M 12
# The models we are going to use.
const_tag: &MODEL1 Xwin-LM-70B-V0.1
const_tag: &MODEL2 Euryale-1.3-L2-70B
const_tag: &MODEL3 WinterGoddess-1.4x-70B-L2
# The amount to scale the residual contributions and the amount to attenuate the Q and K matrices.
# Correlation = 0 --> RESIDUAL_SCALE_FACTOR = sqrt(1/n) & QK_ATTENUATION_FACTOR = sqrt(sqrt(1/n))
# Correlation = 1 --> RESIDUAL_SCALE_FACTOR = 1/n & QK_ATTENUATION_FACTOR = sqrt(1/n)
const_tag: &RESIDUAL_SCALE_FACTOR 0.5773502692 # Set to 1.0 to leave unchanged.
const_tag: &QK_ATTENUATION_FACTOR 0.7598356857 # Set to 1.0 to leave unchanged.
# How we are going to scale each matched tensor type.
scale-filter-env: &scale_filter_env
parameters:
scale:
- filter: o_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [0, 10]
# BASH: for i in {10..49}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *scale_filter_env\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *scale_filter_env\n - sources:\n - model: *MODEL3\n layer_range: [$i, $j]\n <<: *scale_filter_env"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [32, 33]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [40, 41]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [48, 49]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL3
layer_range: [49, 50]
<<: *scale_filter_env
# The last 30 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [50, 80]
merge_method: passthrough
dtype: float16
### USER
Write me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small mirror. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert.
### ASSISTANT
The sun beat down mercilessly on the sandy horizon, casting a hazy glare over the desert landscape. In the distance, an abandoned village loomed like a mirage, its crumbling walls and dilapidated buildings silent witnesses to better times long past.
Kael fled through the cracked and eroded streets of the ghost town, his heart pounding in his chest as he glanced over his shoulder. His pursuers were close behind him, their shouts echoing through the narrow alleyways like vengeful spirits. He had to find a place to hide, to catch his breath and figure out what the hell he'd gotten himself into this time.
He turned a corner, ducking into an abandoned shop, its windows broken and door hanging on rusted hinges. Kael collapsed against the wall, his chest heaving as he struggled to regain his breath.
His mind raced with thoughts of how he'd ended up in this predicament. He was once a respected member of the Order, a cult that worshiped dark forces for power and influence. But something had changed him; Kael couldn't pinpoint when it happened, but he knew deep down in his heart that the path he walked wasn't right. It was then that he made up his mind to leave... to escape the cult and find a new life.
The cultists, however, weren't willing to let him go without a fight. They'd chased him halfway across the land, relentless in their pursuit. Kael knew they wouldn't stop until he was dead or back within their ranks.
As Kael sat in the darkness of the abandoned shop, he felt a small mirror in his pocket. He pulled it out and studied his reflection; gaunt features stared back at him, eyes sunken into their sockets and dark circles under them. His once-long brown hair had now turned prematurely gray at the temples, adding years to his appearance.
Kael sighed and pocketed the mirror, knowing he couldn't stay in one place too long. He needed to keep moving, to find someplace safe until he could figure out what to do next.
The sun dipped below the horizon as Kael stepped out of his hiding place, blending into the shadows cast by the setting sun. He headed into the night, determined to survive another day...
Interesting, and it matches exactly what I had theorised a while ago. At the time I did not work though, and I started exploring the size of the overlaps. Now I realise what I was missing was the attenuation.
I am currently benchmarking the miqu self-merge. However, I find 120b (140 layers) a bit too large. Maybe we can combine it with this this: only duplicate the most important layers, according to exl2 measurements, with the exception of the beginning and the end layers. What I am proposing is the following for a 96b model (112 layers):
# The models we are going to use.
const_tag: &MODEL1 152334H/miqu-1-70b-sf
const_tag: &MODEL2 152334H/miqu-1-70b-sf
# The amount to scale the residual contributions and the amount to attenuate the Q and K matrices.
# Correlation = 0 --> RESIDUAL_SCALE_FACTOR = sqrt(1/2) [≈ 0.7071067812] & QK_ATTENUATION_FACTOR = sqrt(sqrt(1/2)) [≈ 0.8408964153]
# Correlation = 1 --> RESIDUAL_SCALE_FACTOR = 1/2 [≈ 0.5] & QK_ATTENUATION_FACTOR = sqrt(1/2) [≈ 0.7071067812]
const_tag: &RESIDUAL_SCALE_FACTOR 0.5 # Set to 1.0 to leave unchanged.
const_tag: &QK_ATTENUATION_FACTOR 0.7071067812 # Set to 1.0 to leave unchanged.
# How we are going to scale each matched tensor type.
scale-filter-env: &scale_filter_env
parameters:
scale:
- filter: o_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [0, 10]
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *scale_filter_env
# Layers 17-21 are not duplicated (less relevant according to exl2 measurements)
- sources:
- model: *MODEL1
layer_range: [17, 21]
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *scale_filter_env
# Layers 25-49 are not duplicated (less relevant according to exl2 measurements)
- sources:
- model: *MODEL1
layer_range: [25, 49]
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *scale_filter_env
# The last 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [70, 80]
merge_method: passthrough
dtype: float16
Yeah, this should hopefully open up a lot of new ways of merging! I'm keen to see if a non-interleaved version of goliath-120b
can be created.
I've just added a second NVME that will save me from thrashing the main NVME to death and hopefully can set up some way to automatically optimise these parameters soon.
Just tried this:
# > mergekit-yaml --verbose --cuda goliath-alternating.yaml goliath-alternating
# > ~/LLMs/llama.cpp/convert.py goliath-alternating --outfile goliath-alternating-f16.gguf --outtype f16
# > ~/LLMs/llama.cpp/build/bin/imatrix --chunks 200 -ngl 40 -m goliath-alternating-f16.gguf -f ~/LLMs/misc/datasets/groups_merged/groups_merged.txt -o goliath-alternating-f16.imatrix
# > ~/LLMs/llama.cpp/build/bin/quantize --imatrix goliath-alternating-f16.imatrix goliath-alternating-f16.gguf goliath-alternating-q5_K_M.gguf Q5_K_M 12
# The models we are going to use.
const_tag: &MODEL1 Xwin-LM-70B-V0.1
const_tag: &MODEL2 Euryale-1.3-L2-70B
# The amount to scale the residual contributions and the amount to attenuate the Q and K matrices.
# Correlation = 0 --> RESIDUAL_SCALE_FACTOR = sqrt(1/2) [≈ 0.7071067812] & QK_ATTENUATION_FACTOR = sqrt(sqrt(1/2)) [≈ 0.8408964153]
# Correlation = 1 --> RESIDUAL_SCALE_FACTOR = 1/2 [≈ 0.5] & QK_ATTENUATION_FACTOR = sqrt(1/2) [≈ 0.7071067812]
const_tag: &RESIDUAL_SCALE_FACTOR 0.7071067812 # Set to 1.0 to leave unchanged.
const_tag: &QK_ATTENUATION_FACTOR 0.8408964153 # Set to 1.0 to leave unchanged.
# How we are going to scale each matched tensor type.
scale-filter-env: &scale_filter_env
parameters:
scale:
- filter: o_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- value: 1.0
slices:
# The first 8 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [0, 8]
# BASH: for i in {8..71}; do j=$((i+1)); echo " - sources:\n - model: *MODEL1\n layer_range: [$i, $j]\n <<: *scale_filter_env\n - sources:\n - model: *MODEL2\n layer_range: [$i, $j]\n <<: *scale_filter_env"; echo -n; done
- sources:
- model: *MODEL1
layer_range: [8, 9]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [8, 9]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [9, 10]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [9, 10]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *scale_filter_env
# Layer 16: No Xwin.
- sources:
- model: *MODEL2
layer_range: [16, 17]
- sources:
- model: *MODEL1
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [17, 18]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [18, 19]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [19, 20]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [20, 21]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *scale_filter_env
# Layer 24: No Euryale.
- sources:
- model: *MODEL1
layer_range: [24, 25]
- sources:
- model: *MODEL1
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [25, 26]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [26, 27]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [27, 28]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [28, 29]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [29, 30]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [30, 31]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [31, 32]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [31, 32]
<<: *scale_filter_env
# Layer 32: No Xwin.
- sources:
- model: *MODEL2
layer_range: [32, 33]
- sources:
- model: *MODEL1
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [33, 34]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [34, 35]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [35, 36]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [36, 37]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [37, 38]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [38, 39]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [39, 40]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [39, 40]
<<: *scale_filter_env
# Layer 40: No Euryale.
- sources:
- model: *MODEL1
layer_range: [40, 41]
- sources:
- model: *MODEL1
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [41, 42]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [42, 43]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [43, 44]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [44, 45]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [45, 46]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [46, 47]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [47, 48]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [47, 48]
<<: *scale_filter_env
# Layer 48: No Xwin.
- sources:
- model: *MODEL2
layer_range: [48, 49]
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *scale_filter_env
# Layer 56: No Euryale.
- sources:
- model: *MODEL1
layer_range: [56, 57]
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *scale_filter_env
# Layer 64: No Xwin.
- sources:
- model: *MODEL2
layer_range: [64, 65]
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [70, 71]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [70, 71]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [71, 72]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [71, 72]
<<: *scale_filter_env
# The last 8 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [72, 80]
merge_method: passthrough
dtype: float16
but it's nothing like as good at writing compared to the original goliath-120b
. The stories just seem "boring" and it doesn't write really interesting dialogue... The original goliath-120b
is just that tiny bit deranged, but that seems to really help him write good stuff IMO.
It does seem to be a lot more coherent, and less prone to weird spelling mistakes and stopping mid-sentence, but we definitely need to find some way to optimize this... I might try to minimise perplexity on actual text generated by goliath-120b
to see if that can get back some of his writing ability.
So again using the numbers I found a few posts ago for correlated vectors (ie: self-merges), we get:
Scale:
o_proj = down_proj = 0.5
and with the attenuation modification, we get:
Scale:
q_proj = k_proj = sqrt(1/2) ≈ 0.7071067812
and using these numbers we
miqu-test
is born:
I have almost finished putting miqu-test trough all my evaluation tests, and unfortunately I have to report it is a bit brain damaged:
Gateron Brown Pro Mechanical Keyboard, tactile switches, RGB backlighting, 104 keys, Cherry MX Blue equivalent, for gaming and typing.
It's definitely an improvement over my experiments with doubled layers and no attenuation, but it reinforces my findings that doubled layers are not viable. I have not finished testing the 96b self-merge with selection based on layers relevance, but so far, I am seeing scores lower than what I would expect them to be,
Once I finish with those 2, I will go back to non-doubled layers, and test adding the attenuation to the 103b I posted earlier (it uses merge selection based on layers relevance)
Yeah, my attempt didn't work very well with goliath
parents, but think there's still a long way to go testing the parameter combinations:
Euryale
and Xwin-LM
for the inner layers.q_proj = k_proj
∈{sqrt(sqrt(1/2)), 1, 1/sqrt(sqrt(1/2))}
.o_proj
∈ {sqrt(1/2), 1, 1/sqrt(1/2)}
.down_proj
∈ {sqrt(1/2), 1, 1/sqrt(1/2)}
.(the 1/x
tests are to see the effect of doing the exact opposite).
2 × 3 × 3 × 3 = 54
combinations in total.
I've got a second NVME set up that isn't so much hassle to replace if this kills it now.
I'll try and see if I can test all these to see the effect they have.
Finished the evaluations: both miqu-96b and miqu-120b have a poor score and are brain damaged
Has anyone tried downscaling the K and/or Q matrices for repeated layers in franken-merges? This should act like changing the temperature of the softmax and effectively smooth the distribution:
Hopfield Networks is All You Need https://arxiv.org/abs/2008.02217 https://ml-jku.github.io/hopfield-layers/
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models https://arxiv.org/abs/2310.17086
Empirically I've found repeating large blocks does seem to make models "confidently wrong" - stacking two full copies of
deepseek-coder
ormiqu-1
shows this phenomenon really well.