Open CJJ-amateur-programmer opened 3 weeks ago
Try the latest llamafile (0.8.6)? I had no problem running Qwen2-1.5B, I assume 7B should be similar
Try the latest llamafile (0.8.6)? I had no problem running Qwen2-1.5B, I assume 7B should be similar
My fault… I was actually using v0.8.6 but I mistakenly said v0.8.4. I've updated my comment.
Well, I tried again with qwen2-7b-instruct and the bug was still there, replying with endless @
. Plus, I've downloaded llamafile 0.8.6 again but, unfortunately, the results are still no different.
Try the latest llamafile (0.8.6)? I had no problem running Qwen2-1.5B, I assume 7B should be similar
By the way, could you please show me how you loaded the model? Especially the options in the command line, like --gpu
--ctx-size
. Thanks.
I tried again with llama.cpp. Instead of endless GG
, the response became understandable once I passed the option -fa
,
suggesting that this qwen2 issue on llamafile might be properly addressed with flash attention.
Unfortunately, according to the release page, support for -fa
is yet not included in llamafile v0.8.6. Hope for an update.
This might be a windows issue then? But weird if it's only for this model. Maybe try an other export to gguf?
I'm on MacOS, but it works without flash attention. Example of what I'm running, using this Q6_K quant:
llamafile -m Qwen2-1.5B-Instruct.Q6_K.gguf -ngl 999 -t 0.0 -c 0 --no-display-prompt --repeat-penalty 1.1 -p "<some chatml prompt>" --gpu APPLE
I get the expected output, at 440/60 tokens/s for the prompt/eval respectively, on a MacBook Air M3, Sonoma 14.5
This might be a windows issue then? But weird if it's only for this model. Maybe try an other export to gguf?
I'm on MacOS, but it works without flash attention. Example of what I'm running, using this Q6_K quant:
llamafile -m Qwen2-1.5B-Instruct.Q6_K.gguf -ngl 999 -t 0.0 -c 0 --no-display-prompt --repeat-penalty 1.1 -p "<some chatml prompt>" --gpu APPLE
I get the expected output, at 440/60 tokens/s for the prompt/eval respectively, on a MacBook Air M3, Sonoma 14.5
Maybe it's really an issue only related to Windows or cuda. I tried your command line merely with modification to "--gpu" option. But immediately the cmd window crashed into an unexpected error and disappeared without trace.
Managed to reproduce on an AMD machine running Linux with llamafile 0.8.6
HSA_OVERRIDE_GFX_VERSION=11.0.0 ./qwen2-7b-instruct-q8_0.llamafile -ngl 999 --nocompile
The output is filled with @@@@@@
.
The same goes with 1.5B version:
HSA_OVERRIDE_GFX_VERSION=11.0.0 ./qwen2-1_5b-instruct-q5_k_m.llamafile -ngl 999 --nocompile
Interestingly, on the same machine with 0.5B model, no such issue:
HSA_OVERRIDE_GFX_VERSION=11.0.0 ./qwen2-0_5b-instruct-q5_k_m.llamafile -ngl 999 --nocompile
Also, no issue with neither models on CPU based inference.
Still happening with 0.8.7 release on GPU based inference. I guess some update would be required.
Contact Details
No response
What happened?
I downloaded the
gguf
version ofQwen2-7B-Instruct
from https://modelscope.cn/api/v1/models/qwen/Qwen2-7B-Instruct-GGUF/repo?Revision=master&FilePath=qwen2-7b-instruct-q8_0.gguf (also available at https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF/resolve/main/qwen2-7b-instruct-q8_0.gguf ), and loaded the model by runningllamafile.exe -m qwen2-7b-instruct-q8_0.gguf --gpu nvidia --port 1202 --host 0.0.0.0 --nobrowser --ctx-size 2048
Contents in terminal implied that nothing was wrong.But when I ran NextChat v2.12.3 and began a conversation with a simple
Hello!
, the response is@@Hello@@
. With a further queryIntroduce Qwen2.
, the model responded with the following content:@Qwen2@@ is@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
(manually terminated because it seemed to continue with endless@
)Also I tried with the default webui by deleting the
--nobrowser
option and keeping the default options on webui. The results were almost the same.Referring to changelog of ollama v0.1.42 (
https://github.com/ollama/ollama/releases/tag/v0.1.42
), I guessed that llamafile should be updated to add new supports.Configurations: System: Windows 11 Professional 22H2 22621.2361, Windows Feature Experience Pack 1000.22674.1000.0 GPU: NVIDIA GeForce RTX 4060 Laptop GPU 8G Memory chip: 32G
Version
llamafile v0.8.6
What operating system are you seeing the problem on?
No response
Relevant log output