ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.87k stars 9.46k forks source link

Bug: Phi-3 4K output broken after 2000~ tokens (Reproducible) #7709

Open Amadeus-AI opened 4 months ago

Amadeus-AI commented 4 months ago

What happened?

To reproduce: Download the official released gguf model from huggingface/microsoft. Run server.exe -m Phi3-mini-4k.gguf -c 4096

When input prompt < ~2048: Output fine. (but output starts getting weird right after it hits ~2048 in total) When input prompt > ~2048: Output weird.

The weird output seems like what we expect to see when the context is more than the model support, but happens in ~2048, which seems like there are some bugs.

Also tested Llama3-8B, works fine with input prompt < 8192 as expected (with -c 8192), also works fine with input prompt < 4096 as expected (with -c 4096).

Name and Version

version: 3015 (74b239b3) built with MSVC 19.39.33523.0 for x64

Tried both cuda and avx2 version.

Also tried latest version built it myself @ Intel SYCL version: 3075 (3d7ebf63) built with IntelLLVM 2024.1.0

What operating system are you seeing the problem on?

Win10, Win11

Relevant log output

Before ~2000 tokens and after 圖片

matteoserva commented 4 months ago

I can confirm this. I tried to ask it to summarize an article in italian. Everything is fine until it hits the 2000 tokens wall. After that it outputs garbage. The model uses a sliding windows attention of 2048 tokens. It might be related.

Galunid commented 4 months ago

Can you try 6369bf04336ab60e5c892dd77a3246df91015147 and 201cc11afa0a1950e1f632390b2ac6c937a0d8f0 to see if there's a difference? First one should work alright, second should break.

Amadeus-AI commented 4 months ago

@Galunid version: 2960 (6369bf04) built with IntelLLVM 2024.1.0

Still break

Galunid commented 4 months ago

I can reproduce, it seems there's some issue with the initial implementation in #6852

ggerganov commented 4 months ago

It's most likely the missing sliding window, as pointed out earlier

jggc commented 4 months ago

Same issue here but with Llama3 8B on an RTX 4090 with CUDA, and it also completely breaks the server.

When one generation goes beyond the context limit, all subsequent completion fail even from new /completion calls.

I can reproduce this way :

  1. Launch a loop that will call /completion with a long-ish prompt (1k tokens prompt) and ask a short question at the end like What is 2+2
    1. These calls will work forever on their own as they don't go over context length
  2. Open the web interface and have a single conversation that goes over the context length

Result : The loop will start outputting garbage and the conversation with the long prompt will return garbage as well.

My server is running with CUDA on commit 3d7ebf6 tag b3075 from yesterday. I already had this issue on a build from last week, don't know the exact commit. I tried to update in the hope to fix it, but it didn't work.

I'm launching the server with this command :

export LLAMA_CUDA=1
./build/bin/server -m ../llama3/Meta-Llama-3-8B-Instruct-Q6_K.gguf --ctx-size 2048 --n-gpu-layers 9999 --host 0.0.0.0 --port 3300 --timeout 10 --n-predict 500 --parallel 1

A screenshot of the result after context corruption. My other prompt running in a loop is about addresses and even if it is not running anymore the context has leaked and the model will keep outputting garbage until I restart the server.

image

I also tested with --ctx-size=0 which falls back to 8192 and after a (lengthy) conversation it broke too with the exact same behavior. The context from one conversation now leaks into all other conversations.

jggc commented 4 months ago

I just tried with the version suggested above, and it does not work either

version: 2960 (https://github.com/ggerganov/llama.cpp/commit/6369bf04336ab60e5c892dd77a3246df91015147)

The behavior is a the same but it is a lot slower on long context (looks like a caching mechanism was added since then, but that is unlikely to be related)

I have the same problem with the latest commit on master 1442677f92e

Galunid commented 4 months ago

@jggc This topic relates to Phi-3 model that has degradation in quality before it runs out of context, so I marked your comments as off-topic. Quality degradation after you run out is expected and from what I understood that is the case here.

jggc commented 4 months ago

Indeed my behavior is slightly different but still degradation WITHIN the context length. I posted in this thread instead of opening a new issue since it had enough similarities that I thought it might be related.

I'll rephrase to make things clearer :

  1. Start server
  2. Call /completion with a short prompt such as "What is 2+2"
    1. Response is OK
  3. Call /completion with a long prompt exceeding context length
    1. It fails will generate garbage, as expected in this context
  4. Call /completion with a short prompt again "What is 2+2"
    1. Get garbage output, this is not expected. Model state should not be broken in the server after a single prompt exceeded the context length in the session.

At this point, no matter what I do I won't get sensible responses until I restart the server.

@Galunid Let me know if I should open a new bug. It is reproducible, I could write a gist.

ngxson commented 3 months ago

Interestingly, phi-3-small use a combination of sliding window + block sparse attention. So even we got a hack for sliding window (used by gemma 2), it will still be messy if we want proper support for phi-3

Link to paper: https://arxiv.org/pdf/2404.14219

image

njsyw1997 commented 2 months ago

ONNX runtime has the same bug. This might be a reference for us if they can fix it. https://github.com/microsoft/onnxruntime-genai/issues/552

CASE-R commented 2 months ago

Commenting to see if there has been an update/solution to this before it gets closed for activity? We've faced this issue for a month now and using the 128K context models is problematic due to available hardware