ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.95k stars 9.31k forks source link

Bug: phi 3.5 mini produces garbage past 4096 context #9127

Open patw opened 3 weeks ago

patw commented 3 weeks ago

What happened?

Phi 3.5 mini doesn't produce <|end|> or <|endoftext|> when the context is set higher than 4096, just endless garbage tokens. Possible rope scale issue?

Name and Version

llama-server, recent compile

What operating system are you seeing the problem on?

No response

Relevant log output

No response

themanyone commented 3 weeks ago

For conversation, the server is working fine with phi-3.5 quantized to 4 bits. But after a while it started outputting tons of blank lines and garbage when told to make a simple HTML page. Hitting the [Reset] button on the chat server's Gradio page, localhost:8080 fixed it for now. It makes great web pages.

The only thing I can guess is that unusual prompt formats from using other models corrupted the chat history somehow. But I have no way to look into the (now cleared) chat history to see. Will keep testing!

bartowski1182 commented 3 weeks ago

Are you using flash attention or not? I've seen that without flash attention the output is garbage, but with its coherent

patw commented 2 weeks ago

I found with -fa turned on it was running super slow, but also outputting garbage. Right now it's off and stable at 4096.

ThiloteE commented 5 days ago

To do: Test, if fixed by https://github.com/ggerganov/llama.cpp/pull/9396