Open Amadeus-AI opened 4 months ago
I can confirm this. I tried to ask it to summarize an article in italian. Everything is fine until it hits the 2000 tokens wall. After that it outputs garbage. The model uses a sliding windows attention of 2048 tokens. It might be related.
Can you try 6369bf04336ab60e5c892dd77a3246df91015147 and 201cc11afa0a1950e1f632390b2ac6c937a0d8f0 to see if there's a difference? First one should work alright, second should break.
@Galunid version: 2960 (6369bf04) built with IntelLLVM 2024.1.0
Still break
I can reproduce, it seems there's some issue with the initial implementation in #6852
It's most likely the missing sliding window, as pointed out earlier
Same issue here but with Llama3 8B on an RTX 4090 with CUDA, and it also completely breaks the server.
When one generation goes beyond the context limit, all subsequent completion fail even from new /completion calls.
I can reproduce this way :
What is 2+2
Result : The loop will start outputting garbage and the conversation with the long prompt will return garbage as well.
My server is running with CUDA on commit 3d7ebf6
tag b3075 from yesterday. I already had this issue on a build from last week, don't know the exact commit. I tried to update in the hope to fix it, but it didn't work.
I'm launching the server with this command :
export LLAMA_CUDA=1
./build/bin/server -m ../llama3/Meta-Llama-3-8B-Instruct-Q6_K.gguf --ctx-size 2048 --n-gpu-layers 9999 --host 0.0.0.0 --port 3300 --timeout 10 --n-predict 500 --parallel 1
A screenshot of the result after context corruption. My other prompt running in a loop is about addresses and even if it is not running anymore the context has leaked and the model will keep outputting garbage until I restart the server.
I also tested with --ctx-size=0
which falls back to 8192 and after a (lengthy) conversation it broke too with the exact same behavior. The context from one conversation now leaks into all other conversations.
I just tried with the version suggested above, and it does not work either
version: 2960 (https://github.com/ggerganov/llama.cpp/commit/6369bf04336ab60e5c892dd77a3246df91015147)
The behavior is a the same but it is a lot slower on long context (looks like a caching mechanism was added since then, but that is unlikely to be related)
I have the same problem with the latest commit on master 1442677f92e
@jggc This topic relates to Phi-3 model that has degradation in quality before it runs out of context, so I marked your comments as off-topic. Quality degradation after you run out is expected and from what I understood that is the case here.
Indeed my behavior is slightly different but still degradation WITHIN the context length. I posted in this thread instead of opening a new issue since it had enough similarities that I thought it might be related.
I'll rephrase to make things clearer :
At this point, no matter what I do I won't get sensible responses until I restart the server.
@Galunid Let me know if I should open a new bug. It is reproducible, I could write a gist.
Interestingly, phi-3-small use a combination of sliding window + block sparse attention. So even we got a hack for sliding window (used by gemma 2), it will still be messy if we want proper support for phi-3
Link to paper: https://arxiv.org/pdf/2404.14219
ONNX runtime has the same bug. This might be a reference for us if they can fix it. https://github.com/microsoft/onnxruntime-genai/issues/552
Commenting to see if there has been an update/solution to this before it gets closed for activity? We've faced this issue for a month now and using the 128K context models is problematic due to available hardware
What happened?
To reproduce: Download the official released gguf model from huggingface/microsoft. Run server.exe -m Phi3-mini-4k.gguf -c 4096
When input prompt < ~2048: Output fine. (but output starts getting weird right after it hits ~2048 in total) When input prompt > ~2048: Output weird.
The weird output seems like what we expect to see when the context is more than the model support, but happens in ~2048, which seems like there are some bugs.
Also tested Llama3-8B, works fine with input prompt < 8192 as expected (with -c 8192), also works fine with input prompt < 4096 as expected (with -c 4096).
Name and Version
version: 3015 (74b239b3) built with MSVC 19.39.33523.0 for x64
Tried both cuda and avx2 version.
Also tried latest version built it myself @ Intel SYCL version: 3075 (3d7ebf63) built with IntelLLVM 2024.1.0
What operating system are you seeing the problem on?
Win10, Win11
Relevant log output
Before ~2000 tokens and after