abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.17k stars 850 forks source link

Add Self-Extend support? #1242

Open theaerotoad opened 4 months ago

theaerotoad commented 4 months ago

I've been really enjoying using both llama.cpp-python and the original llama.cpp. These are amazing developments here, especially for folks without massively powerful GPUs.

There's a really nice feature that was implemented in llama.cpp in January to allow self-extend (ala LongLLM's approach)). It works well for the llama's main.cpp as well as server.cpp. It works really well, and plenty of folks have noted self-extend is especially useful with Mistral/Mixtral, Gemma, and Phi 2.

It appears someone else might have been asking about this earlier here. Right now, I'm having to move in and out of python when I want to run summarization on a 'just-slightly-too-long' article with self-extend. Would you consider implementing self-extend as an option in llama.cpp-python?

sweetcard commented 4 months ago

any progress ?

sweetcard commented 4 months ago

I find that grp-attn-w and grp-attn-n are not include in llama.h.

Maybe help from llama.cpp will be perfect. Any other idea?

https://github.com/ggerganov/llama.cpp/pull/4815#issuecomment-1985558535

sweetcard commented 4 months ago

https://github.com/abetlen/llama-cpp-python/pull/1090

This is a pr about this feature but it can not work because grp-attn-w and grp-attn-n are not include in llama.h.

theaerotoad commented 4 months ago

Right--it looks like both main.cpp and server.cpp implement self-extend not through anything exposed in llama.h. I think the simplest implementation of it appears in passkey.cpp

Something like:

   ...
    // fill the KV cache
    for (int i = 0; i < n_ctx; i += n_batch) {
        if (i > 0 && n_grp > 1) {
            // if SelfExtend is enabled, we compress the position from the last batch by a factor of n_grp
            const int ib = i/n_batch - 1;
            const int bd = n_batch_grp*(n_grp - 1);

            llama_kv_cache_seq_add (ctx, 0, n_past - n_batch,         n_past,         ib*bd);
            llama_kv_cache_seq_div (ctx, 0, n_past - n_batch + ib*bd, n_past + ib*bd, n_grp);
            llama_kv_cache_update  (ctx);

I've spent some time looking in llama.cpp-python routines, but couldn't find the equivalent place what happens when you exceed the current cache.

It looks like ggerganov may tackling this in the issue @sweetcard linked above. Maybe that's the faster route.

sweetcard commented 3 months ago

any update here? 😄