Open theaerotoad opened 4 months ago
any progress ?
I find that grp-attn-w and grp-attn-n are not include in llama.h.
Maybe help from llama.cpp will be perfect. Any other idea?
https://github.com/ggerganov/llama.cpp/pull/4815#issuecomment-1985558535
https://github.com/abetlen/llama-cpp-python/pull/1090
This is a pr about this feature but it can not work because grp-attn-w and grp-attn-n are not include in llama.h.
Right--it looks like both main.cpp
and server.cpp
implement self-extend not through anything exposed in llama.h
. I think the simplest implementation of it appears in passkey.cpp
Something like:
...
// fill the KV cache
for (int i = 0; i < n_ctx; i += n_batch) {
if (i > 0 && n_grp > 1) {
// if SelfExtend is enabled, we compress the position from the last batch by a factor of n_grp
const int ib = i/n_batch - 1;
const int bd = n_batch_grp*(n_grp - 1);
llama_kv_cache_seq_add (ctx, 0, n_past - n_batch, n_past, ib*bd);
llama_kv_cache_seq_div (ctx, 0, n_past - n_batch + ib*bd, n_past + ib*bd, n_grp);
llama_kv_cache_update (ctx);
I've spent some time looking in llama.cpp-python routines, but couldn't find the equivalent place what happens when you exceed the current cache.
It looks like ggerganov may tackling this in the issue @sweetcard linked above. Maybe that's the faster route.
any update here? 😄
I've been really enjoying using both
llama.cpp-python
and the originalllama.cpp
. These are amazing developments here, especially for folks without massively powerful GPUs.There's a really nice feature that was implemented in
llama.cpp
in January to allow self-extend (ala LongLLM's approach)). It works well for the llama's main.cpp as well as server.cpp. It works really well, and plenty of folks have noted self-extend is especially useful with Mistral/Mixtral, Gemma, and Phi 2.It appears someone else might have been asking about this earlier here. Right now, I'm having to move in and out of python when I want to run summarization on a 'just-slightly-too-long' article with self-extend. Would you consider implementing self-extend as an option in
llama.cpp-python
?