Closed maruel closed 1 week ago
I think there was some problem with Github runners virtualizing the M1 GPU, which I don't fully understand the impact of. Could you try the failing tests by adding -ngl 0
to the llama-server
arguments so that the computation runs on the CPU?
Indeed,
-ngl 0
: https://github.com/maruel/sillybot/actions/runs/10040037929/job/27745233509; the whole test suite, including overheads and downloading qwen2-1_5b-instruct-q2_k (10s) and running a trivial test takes 1m2s.-ngl 9999
: https://github.com/maruel/sillybot/actions/runs/10040125015/job/27745511747; hangs during the test suite (timeout is 10min so it's still on-going).What makes this bug interesting is that Qwen2 0.5B is not affected, and IIRC Mistral 7B Q2_K had also succeeded. Anything 7B is on the large side, downloading overhead becomes significant, so I prefer smaller models for pure unit testing. Also I feel bad for Hugging Face.
I agree with you it's likely a GPU virtualization bug. I have contacts at GitHub, I'll inquire.
I created https://github.com/maruel/github_macos_gpu_bug with a minimized repro. I confirm that running both ./run_test.py 0
and ./run_test.py 999
succeed on a M3 Max, with 999 being faster.
Reproduction: https://github.com/maruel/github_macos_gpu_bug/actions/runs/10042288722
I'm pinging my contact to see if they can help.
Looking at the logs, here is something relevant:
ggml_metal_init: allocating
ggml_metal_init: found device: Apple Paravirtual device
ggml_metal_init: picking default device: Apple Paravirtual device
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple Paravirtual device
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support = false
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 5010.80 MB
Because both simdgroup reduction support
and simdgroup matrix mul. support
are false
, most of the Metal kernels are marked as not supported
:
When an operation is not supported by the GPU, llama.cpp
will automatically copy the necessary data to the CPU and compute it there.
Since Qwen 0.5B Q3_K and Mistral 7B Q2_K pass the test, this means that all unsupported ops are successfully moved to the CPU and everything works. My guess is that Qwen 1.5B Q2_K uses an op that is not correctly marked as unsupported when the Metal simd instructions are missing:
The other explanation is that the generation falls into an endless loop for some reason. Try to add "n_predict": 64
to the HTTP request data so that it generates a maximum of 64 tokens
Thanks for the finding! I pushed a new commit with n_predict=64 and it finished quickly. The output is
"content": " http var W W + W,, W,,, a AR, http,,. a,,, (,,: import import import import import import import import import. import import import import import:,, import import,, import: import import,,: import import import import import import import",
so it seems like it gets into an invalid state causing an infinite loop.
Ref: https://github.com/maruel/github_macos_gpu_bug/actions/runs/10043110963/job/27755098826
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
For my project https://github.com/maruel/sillybot, I run a quick test with Qwen2 0.5B on Ubuntu, macOS and Windows, using the free github VMs. Qwen2 0.5B model is usable in Q3_K_M for simple unit testing queries, is only 356MiB and is Apache v2.0 licensed. It takes seconds to download on the github runners.
It works great! What is weird is that for small models like Qwen2 1.5B GGUF and Phi-3 mini (3.8B) GGUF on github provided macOS runners fail either with a complete hang or returning nothing. I cannot reproduce locally on my M3 Max which makes the whole thing weird.
I see that https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/server.yml runs on Ubuntu and Windows but not macOS.
What folks would think about:
workflows/server.yml
examples/server/tests
to download Qwen2 0.5B Q3_K_M from huggingface and run a quick query on it?I say Qwen2 0.5B but whatever small model is fine.
Somewhat related to issue #3469.
Name and Version
Using b3428, current release as of now.
What operating system are you seeing the problem on?
Mac
Relevant log output
Success with Qwen2 0.5B: https://github.com/maruel/sillybot/actions/runs/10030262026 See how fast it is, what's the slowest is setup-go on Windows. :(
Failure with Qwen2 1.5B on macOS only: https://github.com/maruel/sillybot/actions/runs/10029977170/job/27718801617. Times out after 5 minutes, which doesn't make sense for the trivial prompt
"You are an AI assistant. You strictly follow orders. Reply exactly with what is asked of you."
+"reply with \"ok chief\""