ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.02k stars 9.32k forks source link

Bug: macOS GitHub Actions hosted runners hang when running small models #8617

Closed maruel closed 1 week ago

maruel commented 1 month ago

What happened?

For my project https://github.com/maruel/sillybot, I run a quick test with Qwen2 0.5B on Ubuntu, macOS and Windows, using the free github VMs. Qwen2 0.5B model is usable in Q3_K_M for simple unit testing queries, is only 356MiB and is Apache v2.0 licensed. It takes seconds to download on the github runners.

It works great! What is weird is that for small models like Qwen2 1.5B GGUF and Phi-3 mini (3.8B) GGUF on github provided macOS runners fail either with a complete hang or returning nothing. I cannot reproduce locally on my M3 Max which makes the whole thing weird.

I see that https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/server.yml runs on Ubuntu and Windows but not macOS.

What folks would think about:

I say Qwen2 0.5B but whatever small model is fine.

Somewhat related to issue #3469.

Name and Version

Using b3428, current release as of now.

What operating system are you seeing the problem on?

Mac

Relevant log output

Success with Qwen2 0.5B: https://github.com/maruel/sillybot/actions/runs/10030262026 See how fast it is, what's the slowest is setup-go on Windows. :(

Failure with Qwen2 1.5B on macOS only: https://github.com/maruel/sillybot/actions/runs/10029977170/job/27718801617. Times out after 5 minutes, which doesn't make sense for the trivial prompt "You are an AI assistant. You strictly follow orders. Reply exactly with what is asked of you." + "reply with \"ok chief\""

ggerganov commented 1 month ago

I think there was some problem with Github runners virtualizing the M1 GPU, which I don't fully understand the impact of. Could you try the failing tests by adding -ngl 0 to the llama-server arguments so that the computation runs on the CPU?

maruel commented 1 month ago

Indeed,

What makes this bug interesting is that Qwen2 0.5B is not affected, and IIRC Mistral 7B Q2_K had also succeeded. Anything 7B is on the large side, downloading overhead becomes significant, so I prefer smaller models for pure unit testing. Also I feel bad for Hugging Face.

I agree with you it's likely a GPU virtualization bug. I have contacts at GitHub, I'll inquire.

maruel commented 1 month ago

I created https://github.com/maruel/github_macos_gpu_bug with a minimized repro. I confirm that running both ./run_test.py 0 and ./run_test.py 999 succeed on a M3 Max, with 999 being faster.

Reproduction: https://github.com/maruel/github_macos_gpu_bug/actions/runs/10042288722

I'm pinging my contact to see if they can help.

ggerganov commented 1 month ago

Looking at the logs, here is something relevant:

ggml_metal_init: allocating
ggml_metal_init: found device: Apple Paravirtual device
ggml_metal_init: picking default device: Apple Paravirtual device
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple Paravirtual device
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = false
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5010.80 MB

Because both simdgroup reduction support and simdgroup matrix mul. support are false, most of the Metal kernels are marked as not supported:

``` ggml_metal_init: skipping kernel_soft_max_f16 (not supported) ggml_metal_init: skipping kernel_soft_max_f16_4 (not supported) ggml_metal_init: skipping kernel_soft_max_f32 (not supported) ggml_metal_init: skipping kernel_soft_max_f32_4 (not supported) ggml_metal_init: skipping kernel_rms_norm (not supported) ggml_metal_init: skipping kernel_group_norm (not supported) ggml_metal_init: skipping kernel_mul_mv_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f16 (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_vec_f16_h128 (not supported) ```

When an operation is not supported by the GPU, llama.cpp will automatically copy the necessary data to the CPU and compute it there.

Since Qwen 0.5B Q3_K and Mistral 7B Q2_K pass the test, this means that all unsupported ops are successfully moved to the CPU and everything works. My guess is that Qwen 1.5B Q2_K uses an op that is not correctly marked as unsupported when the Metal simd instructions are missing:

https://github.com/ggerganov/llama.cpp/blob/6f11a83e4e7700fdf353ed4a29599cb662c792f6/ggml/src/ggml-metal.m#L488-L664

The other explanation is that the generation falls into an endless loop for some reason. Try to add "n_predict": 64 to the HTTP request data so that it generates a maximum of 64 tokens

maruel commented 1 month ago

Thanks for the finding! I pushed a new commit with n_predict=64 and it finished quickly. The output is

    "content": " http var W W + W,, W,,, a AR, http,,. a,,, (,,: import import  import import import import import import import. import import import import import:,, import import,, import: import import,,: import import import import import import import",

so it seems like it gets into an invalid state causing an infinite loop.

Ref: https://github.com/maruel/github_macos_gpu_bug/actions/runs/10043110963/job/27755098826

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.