Closed duykhanhbk closed 6 months ago
yes please
I'll look into implementing it. In the meantime, do you observe positive results when using main
? The llama.cpp
implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions
@ggerganov I'm happy to announce that the problem with dolphin phi with long context is solved now, and even better with group attention. And also I tried the other model and it makes different, so I think it works
I just dont have understanding between
-c 4096 --grp-attn-n 4 --grp-attn-w 1024
How to calculate the context desired. Could you give enlightment here ? 😄
I'll look into implementing it. In the meantime, do you observe positive results when using
main
? Thellama.cpp
implementation tries to follow the one from the paper, but it is not exactly the same as it applies changes (shifts) to the KV cache instead of recomputing the RoPE. It should be similar (if not the same), though I still have some doubts. My quick tests indicate that it seems to work, but I don't want to make any conclusions
@ggerganov @x4080 yes, I tested with SeaLLM 7b chat (Llama2 architecture) and extend context to 16k, 26k, the result is also quite good, looks promising, I will test with mistral, phi,.. to see how it works and the same question like @x4080 How to calculate the context desired. Could you give enlightment here @ggerganov? It's a bit difference from here https://github.com/datamllab/LongLM
First, you set -c
to the context that you want to achieve - let's say -c 8192
.
Next, given that the original training context of the model is T
(let's assume T = 2048
), you want to set G >= 8192 / T
, so in this case: --grp-attn-n 4
or --grp-attn-n 8
.
The --grp-attn-w
corresponds to W
from the paper. I think the authors generally used 512
, but I think you can go up to T/2
- so in this case --grp-attn-w 1024
.
Additionally, G
has to be multiple of W
@ggerganov Thanks for detailed answer
Is there any update on implementing self-extend for the server @ggerganov?
If someone wants to give it a try - go ahead. When I get to this, will assign myself to the issue - for now other priorities
This is a pull request about this issue https://github.com/ggerganov/llama.cpp/pull/4963 Plz check it @ggerganov!
It's just adding the cmd-line arguments - there is no actual implementation
Yes, I see too. Waiting for actual implementation!
Sorry, I started to do it following what was done to main on the server example, then my kid made a big mess and I didn't get to finish up and haven't had a chance to get back to it still. Hopefully someone will have time to do it before I do again.
@ggerganov Is it ok when I do this? I will start working on it now.
I looked at your main implementation and it looks doable for me. Is there anything I need to look out for?
I suppose it would be a good idea to put this code behind a llama.h
function to avoid these raw computations to be copy-pasted. I haven't thought deeply for the API, but maybe you can give it a try
@ggerganov Well, actually I just copy pasted it. But I will refactor it once I know I have the correct way of doing this!
I finished porting of self extend to the server. It is in this pull request. https://github.com/ggerganov/llama.cpp/pull/5104
Hi @Maximilian-Winter , first thanks for your work. But I have found problem with KV cache https://github.com/ggerganov/llama.cpp/pull/5104#issuecomment-1907347414
closing as completed.
(@duykhanhbk i found the same issue, but the self extend is not the cause)
Hi, I found that using server with --grp-attn-n -> in some situation it will stop inference prematurely, I tested using server with and without, without --grp-attn-n works flawlessly, then I tried using non server (cmd line inference with grp-attn) and works fine, so maybe there's problem with the implementation in server ?
@x4080 it would really help if you added a scenario regarding group attention self extension using the server test framework.
@phymbert thanks for replying, I'm currently using my own fine tune model with my private data, but what I did with the model is to translate to another language, I know its difficult to fix things without repeatable evidence, maybe I can find other example with public model - I'll share it
Self extend is now supported for main: Link: https://github.com/ggerganov/llama.cpp/pull/4815 Link paper: https://arxiv.org/pdf/2401.01325.pdf It would be great if it was also supported for the server, any guidance or support is welcome!