ggerganov / llama.cpp

LLM inference in C/C++
MIT License
62.58k stars 8.97k forks source link

Support Self extend for server (main is already supported) #4886

Closed duykhanhbk closed 6 months ago

duykhanhbk commented 6 months ago

Self extend is now supported for main: Link: https://github.com/ggerganov/llama.cpp/pull/4815 Link paper: https://arxiv.org/pdf/2401.01325.pdf It would be great if it was also supported for the server, any guidance or support is welcome!

x4080 commented 6 months ago

yes please

ggerganov commented 6 months ago

I'll look into implementing it. In the meantime, do you observe positive results when using main? The llama.cpp implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions

x4080 commented 6 months ago

@ggerganov I'm happy to announce that the problem with dolphin phi with long context is solved now, and even better with group attention. And also I tried the other model and it makes different, so I think it works

I just dont have understanding between

-c 4096 --grp-attn-n 4 --grp-attn-w 1024

How to calculate the context desired. Could you give enlightment here ? 😄

duykhanhbk commented 6 months ago

I'll look into implementing it. In the meantime, do you observe positive results when using main? The llama.cpp implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions

@ggerganov @x4080 yes, I tested with SeaLLM 7b chat (Llama2 architecture) and extend context to 16k, 26k, the result is also quite good, looks promising, I will test with mistral, phi,.. to see how it works and the same question like @x4080 How to calculate the context desired. Could you give enlightment here @ggerganov? It's a bit difference from here https://github.com/datamllab/LongLM

ggerganov commented 6 months ago

First, you set -c to the context that you want to achieve - let's say -c 8192.

Next, given that the original training context of the model is T (let's assume T = 2048), you want to set G >= 8192 / T, so in this case: --grp-attn-n 4 or --grp-attn-n 8.

The --grp-attn-w corresponds to W from the paper. I think the authors generally used 512, but I think you can go up to T/2 - so in this case --grp-attn-w 1024.

Additionally, G has to be multiple of W

x4080 commented 6 months ago

@ggerganov Thanks for detailed answer

duykhanhbk commented 6 months ago

Is there any update on implementing self-extend for the server @ggerganov?

ggerganov commented 6 months ago

If someone wants to give it a try - go ahead. When I get to this, will assign myself to the issue - for now other priorities

duykhanhbk commented 6 months ago

This is a pull request about this issue https://github.com/ggerganov/llama.cpp/pull/4963 Plz check it @ggerganov!

ggerganov commented 6 months ago

It's just adding the cmd-line arguments - there is no actual implementation

duykhanhbk commented 6 months ago

Yes, I see too. Waiting for actual implementation!

Josh-XT commented 6 months ago

Sorry, I started to do it following what was done to main on the server example, then my kid made a big mess and I didn't get to finish up and haven't had a chance to get back to it still. Hopefully someone will have time to do it before I do again.

Maximilian-Winter commented 6 months ago

@ggerganov Is it ok when I do this? I will start working on it now.

Maximilian-Winter commented 6 months ago

I looked at your main implementation and it looks doable for me. Is there anything I need to look out for?

ggerganov commented 6 months ago

I suppose it would be a good idea to put this code behind a llama.h function to avoid these raw computations to be copy-pasted. I haven't thought deeply for the API, but maybe you can give it a try

Maximilian-Winter commented 6 months ago

@ggerganov Well, actually I just copy pasted it. But I will refactor it once I know I have the correct way of doing this!

Maximilian-Winter commented 6 months ago

I finished porting of self extend to the server. It is in this pull request. https://github.com/ggerganov/llama.cpp/pull/5104

duykhanhbk commented 6 months ago

Hi @Maximilian-Winter , first thanks for your work. But I have found problem with KV cache https://github.com/ggerganov/llama.cpp/pull/5104#issuecomment-1907347414

Green-Sky commented 6 months ago

closing as completed.

(@duykhanhbk i found the same issue, but the self extend is not the cause)

x4080 commented 5 months ago

Hi, I found that using server with --grp-attn-n -> in some situation it will stop inference prematurely, I tested using server with and without, without --grp-attn-n works flawlessly, then I tried using non server (cmd line inference with grp-attn) and works fine, so maybe there's problem with the implementation in server ?

phymbert commented 5 months ago

@x4080 it would really help if you added a scenario regarding group attention self extension using the server test framework.

x4080 commented 5 months ago

@phymbert thanks for replying, I'm currently using my own fine tune model with my private data, but what I did with the model is to translate to another language, I know its difficult to fix things without repeatable evidence, maybe I can find other example with public model - I'll share it