Open Andrei-Aksionov opened 3 months ago
This might actually fix my OLMo PR in #927 😅
You might want to merge OLMo with an interleaving conversion step because this PR is very risky and a breaking change for all existing checkpoints
@Andrei-Aksionov We need to evaluate if we want to make this change. Especially if there are any performance differences and whether the risk is worth it. But there are two important considerations:
wip
branch, where we are making also breaking changes.These are all good questions.
We need to evaluate if we want to make this change
Especially if there are any performance differences
That's a good question, and the plan is to do some performance benchmarks, after PR is ready.
One of the concerns is that the current code for grouped queries uses .expand
method, which just creates another view and should be definitely faster and more memory efficient than .repeat_interleave
. But, .expand
makes the tensor non-contiguous and thus .reshape
method will create a copy of the tensor.
https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/model.py#L211-L217
So, in overall, the performance should be in the same ballpark, but I'll verify it with benchmarks, to be on the safe side.
Additionally, the current code also caches keys and values after they are expanded, making it not very efficient. The code in PR doesn't do it, but at the same time I have an error with KV-cache, so we shall see whether I can make it work or not.
This needs to provide conversion logic since all existing checkpoints (from HF, pretrained, or fine-tuned) will be broken. Extending what Adrian is doing in https://github.com/Lightning-AI/lit-gpt/pull/1019.
Yep, agree.
This would need to finish in 1-2 weeks to be released together with the wip branch, where we are making also breaking changes.
I think I can make it by the end of Friday. 🤞
One recommendation, if we want to merge it in some future.
Since we need to deal with legacy checkpoints and somehow determine if we need to reassemble it to "non-interleaved" variant, I think we can rename the linear layer from attn
to qkv
.
Plus, it will make the naming much cleaner. Right now layers for SelfAttention class are:
In my opinion, this variant looks better:
Looks like a win-win situation.
The name is directly inherited from https://github.com/karpathy/nanoGPT/blob/master/model.py#L35. We took the liberty of dropping out the convolutional past "c_"
I've just called Andrej, and he doesn't mind if we rename it to qkv
.
Hope he approves the PR then
Hi there 👋
This PR changes placement of
query
,key
andvalue
weights in the combined QKV matrix. Intuitively one might assume that weights in the QKV matrix are combined in a sequential order $[Q,Q,...,K,K,...,V,V,...]$, but in-fact they are currently placed in interleaved one $[Q, K, V, Q, K, V, ..., Q, K, V]$. I believe this placement was introduced by models that used GPTNeoX MLP layers (Pythia and Falcon), but all the latest models doesn't use such interleaved placement.That means that:
Benchmarks
Benchmarks were done on a
1xA10G
with the code from main and this PR branches.Train
was done5 times
for each model/branch and the results were averaged. Args: 1 epoch, epoch size is 1k. Other args were by default.Generate
was done100 times
andGenereate with max_new_tokens=1950
-10 times
.Models: two variants with 1 billion of parameters, simply to speed up benchmarks. Pythia represents a model with multi-head attention, while TinyLlaMA - grouped-query attention. For multi-query attention there should not be any significant difference.
In training mode the PR version is slightly faster, but for TinyLlama VRAM consumption was slightly larger, oddly enough. I tried to do
unsqueeze(...).expand(...).reshape(...)
but the VRAM consumption stayed the same.In inference mode the PR version is again faster, but the biggest difference is in VRAM consumption, since we don't need to cache KV after the expansion, like it's done in the current main.