Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.75k stars 830 forks source link

Bug: Incompatible with Newest Qwen 2 #467

Open CJJ-amateur-programmer opened 3 weeks ago

CJJ-amateur-programmer commented 3 weeks ago

Contact Details

No response

What happened?

I downloaded the gguf version of Qwen2-7B-Instruct from https://modelscope.cn/api/v1/models/qwen/Qwen2-7B-Instruct-GGUF/repo?Revision=master&FilePath=qwen2-7b-instruct-q8_0.gguf (also available at https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF/resolve/main/qwen2-7b-instruct-q8_0.gguf ), and loaded the model by running llamafile.exe -m qwen2-7b-instruct-q8_0.gguf --gpu nvidia --port 1202 --host 0.0.0.0 --nobrowser --ctx-size 2048 Contents in terminal implied that nothing was wrong.

But when I ran NextChat v2.12.3 and began a conversation with a simple Hello!, the response is @@Hello@@. With a further query Introduce Qwen2., the model responded with the following content: @Qwen2@@ is@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@(manually terminated because it seemed to continue with endless @)

Also I tried with the default webui by deleting the --nobrowser option and keeping the default options on webui. The results were almost the same.

Referring to changelog of ollama v0.1.42 ( https://github.com/ollama/ollama/releases/tag/v0.1.42 ), I guessed that llamafile should be updated to add new supports.

Configurations: System: Windows 11 Professional 22H2 22621.2361, Windows Feature Experience Pack 1000.22674.1000.0 GPU: NVIDIA GeForce RTX 4060 Laptop GPU 8G Memory chip: 32G

Version

llamafile v0.8.6

What operating system are you seeing the problem on?

No response

Relevant log output

warming up the model with an empty run

llama server listening at http://127.0.0.1:1202
llama server listening at http://10.171.87.232:1202
llama server listening at http://192.168.198.1:1202
llama server listening at http://192.168.186.1:1202
llama server listening at http://26.68.246.204:1202
m3at commented 2 weeks ago

Try the latest llamafile (0.8.6)? I had no problem running Qwen2-1.5B, I assume 7B should be similar

CJJ-amateur-programmer commented 2 weeks ago

Try the latest llamafile (0.8.6)? I had no problem running Qwen2-1.5B, I assume 7B should be similar

My fault… I was actually using v0.8.6 but I mistakenly said v0.8.4. I've updated my comment. Well, I tried again with qwen2-7b-instruct and the bug was still there, replying with endless @. Plus, I've downloaded llamafile 0.8.6 again but, unfortunately, the results are still no different.

CJJ-amateur-programmer commented 2 weeks ago

Try the latest llamafile (0.8.6)? I had no problem running Qwen2-1.5B, I assume 7B should be similar

By the way, could you please show me how you loaded the model? Especially the options in the command line, like --gpu --ctx-size. Thanks.

CJJ-amateur-programmer commented 2 weeks ago

I tried again with llama.cpp. Instead of endless GG, the response became understandable once I passed the option -fa, suggesting that this qwen2 issue on llamafile might be properly addressed with flash attention. Unfortunately, according to the release page, support for -fa is yet not included in llamafile v0.8.6. Hope for an update.

m3at commented 2 weeks ago

This might be a windows issue then? But weird if it's only for this model. Maybe try an other export to gguf?

I'm on MacOS, but it works without flash attention. Example of what I'm running, using this Q6_K quant:

llamafile -m Qwen2-1.5B-Instruct.Q6_K.gguf -ngl 999 -t 0.0 -c 0 --no-display-prompt --repeat-penalty 1.1 -p "<some chatml prompt>" --gpu APPLE

I get the expected output, at 440/60 tokens/s for the prompt/eval respectively, on a MacBook Air M3, Sonoma 14.5

CJJ-amateur-programmer commented 2 weeks ago

This might be a windows issue then? But weird if it's only for this model. Maybe try an other export to gguf?

I'm on MacOS, but it works without flash attention. Example of what I'm running, using this Q6_K quant:

llamafile -m Qwen2-1.5B-Instruct.Q6_K.gguf -ngl 999 -t 0.0 -c 0 --no-display-prompt --repeat-penalty 1.1 -p "<some chatml prompt>" --gpu APPLE

I get the expected output, at 440/60 tokens/s for the prompt/eval respectively, on a MacBook Air M3, Sonoma 14.5

Maybe it's really an issue only related to Windows or cuda. I tried your command line merely with modification to "--gpu" option. But immediately the cmd window crashed into an unexpected error and disappeared without trace.

lovenemesis commented 2 weeks ago

Managed to reproduce on an AMD machine running Linux with llamafile 0.8.6

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./qwen2-7b-instruct-q8_0.llamafile -ngl 999 --nocompile

The output is filled with @@@@@@.

The same goes with 1.5B version:

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./qwen2-1_5b-instruct-q5_k_m.llamafile -ngl 999 --nocompile

Interestingly, on the same machine with 0.5B model, no such issue:

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./qwen2-0_5b-instruct-q5_k_m.llamafile -ngl 999 --nocompile

Also, no issue with neither models on CPU based inference.

lovenemesis commented 5 days ago

Still happening with 0.8.7 release on GPU based inference. I guess some update would be required.