Open patw opened 3 weeks ago
For conversation, the server is working fine with phi-3.5 quantized to 4 bits. But after a while it started outputting tons of blank lines and garbage when told to make a simple HTML page. Hitting the [Reset] button on the chat server's Gradio page, localhost:8080 fixed it for now. It makes great web pages.
The only thing I can guess is that unusual prompt formats from using other models corrupted the chat history somehow. But I have no way to look into the (now cleared) chat history to see. Will keep testing!
Are you using flash attention or not? I've seen that without flash attention the output is garbage, but with its coherent
I found with -fa turned on it was running super slow, but also outputting garbage. Right now it's off and stable at 4096.
To do: Test, if fixed by https://github.com/ggerganov/llama.cpp/pull/9396
What happened?
Phi 3.5 mini doesn't produce <|end|> or <|endoftext|> when the context is set higher than 4096, just endless garbage tokens. Possible rope scale issue?
Name and Version
llama-server, recent compile
What operating system are you seeing the problem on?
No response
Relevant log output
No response