Closed enolan closed 3 months ago
I can reproduce on 201cc11afa0a1950e1f632390b2ac6c937a0d8f0
<|begin_of_text|>
>
<|start_header_id|>user<|end_header_id|>
hello llama
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hello, I am i am looking for a little help me<|eot_id|>
>
<|start_header_id|>user<|end_header_id|>
I know llama. I know.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hello, what can you want to help me<|eot_id|>
>
<|start_header_id|>user<|end_header_id|>
<|begin_of_text|>
>
<|start_header_id|>user<|end_header_id|>
hello llama
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
I'm not found this is a friendly assistant
I can help me
What are you'request
Hello there.<|eot_id|>
>
<|start_header_id|>user<|end_header_id|>
what can you help with
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
I'do<|eot_id|>
>
<|start_header_id|>user<|end_header_id|>
same for me on this thread https://github.com/ggerganov/llama.cpp/issues/7450
Yes, I also get gibberish output, older version like b2953 works normally for the same gguf file. The latest cuda version somehow generates gibberish, vulkan works fine.
I can confirm, after #7225 generation is completely broken. I checked it on CPU, 4090 and P40, on different models. I tried b2961, tried FORCE_MMQ, with and without FA. Nothing works. It's sad that we don't have normal autotests.
Check if https://github.com/ggerganov/llama.cpp/pull/7452 fixes the issue
Looks good on my end cherry picking #7452 into master
./llama -m /mnt/ssd/ai/txtgen/models/gguf/neopolita_meta-llama-3-8b-instruct_q6_k.gguf -ngl 100 -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot we
ather." -s 1 --color -n 64
----
<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather. Seersucker was originally an Indian cotton fabric, brought to the United States by [[colonial American]]s and [[Native Americans]]. It is typically white or light-colored with a contrasting stripe or check pattern.
Seersucker fabric is known for its unique texture, which is created by interlocking loops of yarn, which
llama_print_timings: load time = 1162.67 ms
llama_print_timings: sample time = 6.00 ms / 64 runs ( 0.09 ms per token, 10663.11 tokens per second)
llama_print_timings: prompt eval time = 125.71 ms / 48 tokens ( 2.62 ms per token, 381.82 tokens per second)
llama_print_timings: eval time = 1552.75 ms / 63 runs ( 24.65 ms per token, 40.57 tokens per second)
llama_print_timings: total time = 1728.66 ms / 111 tokens
I believe that this issue is still present on latest releases. The issue is more prevalent when multiple people connect to the same instance.
I've gone back now to ecab1c75de68de7c41c254e2ae170d3b07bee6d4 and it works as before, since I really just need the new /health endpoint.
On 201cc11a, I get gibberish output trying to sample from Llama-3-8B quantized with Q5_K_M (same behavior with Q8_0, F16, F32, and Q4_K_M). This happens when llama.cpp is built with CUDA support, but not without. I'm building these with Nix. Here's an example output:
It starts talking about Annapolis, Maryland for some reason, instead of fabric. Other seeds are also nonsense, either gibberish or a nonsensical change of topic. In contrast, CPU only build is fine:
It's repeating itself, but it at least makes sense. 6369bf04 (the previous commit) is fine for CUDA (and CPU):