ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.76k stars 9.72k forks source link

Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs) #8455

Closed AndreasKunar closed 3 months ago

AndreasKunar commented 4 months ago

What happened?

The new Copilot+PCs with Qualcomm Snapdragon X processors (in my case a Surface 11 Pro with Snapdragon X Plus and 16GB RAM) are fast, and run llama.cpp on the CPU w/o issues. They also include a Vulkan driver and run the Vulkan samples w/o problems. But llama.cpp built with Vulkan does (now finally build,) but not run.

llama-cli is terminating on model-load with: llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown llama_load_model_from_file: failed to load model main: error: unable to load model

Name and Version

llama-cli version: 3378 (71c1121d) with a quick-fix to compile (see #8446), built with MSVC 19.40.33812.0 for ARM64

built with: Installed VulkanSDK for Windows x64, then built a Windows arm64 version of KhronosGroup/Vulkan-Loader vulkan-1.lib (+tested its functionality with tests+samples) and copied it to VulkanSDK lib-directory for llama.cpp building.

REM including Vulkan diagnostics
> cmake -B build -DGGML_VULKAN=1 -DGGML_VULKAN_DEBUG=1 -DGGML_VULKAN_MEMORY_DEBUG=1
> cmake --build build --config Release --target llama-cli

What operating system are you seeing the problem on?

Windows

Relevant log output

console output.txt main.log vulkaninfo.txt

0cc4m commented 4 months ago

I think that's the same bug that happens on Snapdragon phones: #5186

Some shader compiler bug in the Adreno driver. Might be good to report it to Qualcomm.

AndreasKunar commented 4 months ago

I think that's the same bug that happens on Snapdragon phones: #5186

Some shader compiler bug in the Adreno driver. Might be good to report it to Qualcomm.

@0cc4m thanks a lot. I will look into it.

I tried to run it also on WSL2 with its Microsoft CPU-emulated Vulkan-Driver. There it does not crash, but the results generated are garbage (as well as very slow).

I have tried to reach out to Qualcomm and will see if they answer.

sykuang commented 3 months ago

Have you experimented with the link https://apps.microsoft.com/detail/9nqpsl29bfff?hl=en-US&gl=US that facilitates the conversion of Vulkan shaders to D12 shaders?

AndreasKunar commented 3 months ago

Have you experimented with the link https://apps.microsoft.com/detail/9nqpsl29bfff?hl=en-US&gl=US that facilitates the conversion of Vulkan shaders to D12 shaders?

@skyan - thanks, its latest version installs automatically on the Surface with Snapdragon X. It implements a Microsoft Vulkan driver, which shows in addition to the Qualcomm Vulkan driver, and both show in vulkaninfo. llama.cpp's Vulkan backend picks the native Qualcomm driver (probably derived from their Android work), which seems to implement more/better Vulkan features than Microsoft's translation-driver to the (Qualcomm provided) native DX12 driver.

I will have a look, if I can test-tweak the llama.cpp Vulkan backend to use the Microsoft translation driver and report here, if I manage it and it works.

0cc4m commented 3 months ago

@AndreasKunar To try it you just have to manually pick the device, by setting the environment variable GGML_VK_VISIBLE_DEVICES to the index of the device you want on vulkaninfo --summary

AndreasKunar commented 3 months ago

@AndreasKunar To try it you just have to manually pick the device, by setting the environment variable GGML_VK_VISIBLE_DEVICES to the index of the device you want on vulkaninfo --summary

@0cc4m and @sykuang - thanks a lot!

Now with Microsoft's Vulkan to DX12 driver selected, the error changes from "llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown" to "llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorOutOfHostMemory". Trace with tinyllama-1.1b/ggml-model-f16.gguf below:

ggml_vk_instance_init() ggml_vulkan: Found 1 Vulkan devices: ggml_vk_print_gpu_info(1) Vulkan0: Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU) (Dozen) | uma: 1 | fp16: 1 | warp size: 64 ggml_vk_get_device(0) Initializing new vk_device ggml_vk_find_queue_family_index() ggml_vk_find_queue_family_index() ggml_vk_create_queue() ggml_vk_load_shaders(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU)) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_q4_0_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) MESA: error: == VALIDATION ERROR ============================================= error: Total Thread Group Shared Memory storage is 33792, exceeded 32768. Validation failed.

0cc4m commented 3 months ago

That means that the D3D12 translation layer doesn't provide enough shared memory for the big (_l) matrix multiplication shaders. You might be able to work around that by disabling them for Qualcomm GPUs (I don't think it's likely they are helpful here, they are meant for big dedicated GPUs) and not loading the pipelines in that case.

sykuang commented 3 months ago

@0cc4m and @AndreasKunar, I wanted to let you know that thanks to @0cc4m's input, I can run phi-3 on the Snapdragon X platform.

AndreasKunar commented 3 months ago

@sykuang - cool! What did you do exactly? Just reduce/edit the kernels (which?) and set GGML_VK_VISIBLE_DEVICES=1 (to the D2D12 driver)?

sykuang commented 3 months ago

@AndreasKunar, I've adjusted the code in ggml_vk_load_shaders by commenting out certain sections, which has resolved the issue where DirectX was reporting out-of-memory errors. You can refer to https://github.com/sykuang/llama.cpp/commit/b18e64826af69fe3765cd3b03c8dcd831a6bceca

AndreasKunar commented 3 months ago

@AndreasKunar, I've adjusted the code in ggml_vk_load_shaders by commenting out certain sections, which has resolved the issue where DirectX was reporting out-of-memory errors. You can refer to sykuang@b18e648

Thanks, I tried to get it to run, but once I offloaded Phi3-4k layers onto the GPU, either the results got strange or llama-cli crashed.

Vulkan performance on Snapdragon X Plus also was much worse than e.g. Q4_0_4_8 - Q4 vs. Q4_0_4_8 vs. Vulkan:

model size params backend threads test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B CPU 10 pp512 68.99 ± 11.74
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B CPU 10 tg128 26.42 ± 8.15
phi3 3B Q4_0_4_8 2.03 GiB 3.82 B CPU 10 pp512 208.45 ± 73.38
phi3 3B Q4_0_4_8 2.03 GiB 3.82 B CPU 10 tg128 34.83 ± 5.59
Vulkan0: Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU) (Dozen) uma: 1 fp16: 1 warp size: 64 model size params backend ngl threads test t/s
phi3 3B Q4_0_4_8 2.03 GiB 3.82 B Vulkan 99 10 pp512 40.67 ± 0.41
phi3 3B Q4_0_4_8 2.03 GiB 3.82 B Vulkan 99 10 tg128 5.71 ± 0.20

build: 081fe431 (3441)

Can you maybe show your performance (llama-bench -p 512 -n 128) results? Currently for me it looks like Q4_0_4_8 quantization using the CPU's MATMUL is much faster than Vulkan.

slaren commented 3 months ago

Q4_0_4_8 is not supported by the Vulkan backend, it will run on the CPU.

AndreasKunar commented 3 months ago

Q4_0_4_8 is not supported by the Vulkan backend, it will run on the CPU.

Thanks a lot, mea-culpa, I did not know this. However even the "reduced-kernels" with the Vulkan-Backend and the Vulkan-to-DirectX12 driver as well as the Phi3 Q4 models run out of memory on load for me. The Q4_0_4_8 did at least run llama-bench, and sorry, I never verified the llama-cli results (which I now see, are garbage).

What I'm trying to find out, is if Vulkan/GPU on Snapdragon X for Q4 is (or can be) faster than the Q4_0_4_8 optimized kernels on the CPU.

hmartinez82 commented 3 months ago

@AndreasKunar You made further than I did when I last attempted this. I didn't even got as far as building the native ARM64 Vulkan loader. I had to rely to that Vulkan wrapper provided by Microsoft, and yes, full of garbage and really slow :(

Hit me up with anything you want help testing.

I'm going to try building Vulkan-Loader vulkan-1.lib locally. Was it straightforward?

AndreasKunar commented 3 months ago

@AndreasKunar You made further than I did when I last attempted this. I didn't even got as far as building the native ARM64 Vulkan loader. I had to rely to that Vulkan wrapper provided by Microsoft, and yes, full of garbage and really slow :(

Hit me up with anything you want help testing.

Thanks!!! My problem is that the native Qualcomm Vulkan driver does not load (according to 0cc4m a similar bug to Android). And the Microsoft Vulkan to DirectX12 translation driver runs out of memory, also seems very slow.

So currently I have given up, because the Q4_0_4_8 acceleration for the Snapdragon X CPU is now nearly as fast as my M2 Mac's 10-core GPU (which should i theory be faster than the Snapdragon's GPU). I am waiting to see, what the work on QNN (Qualcomm NPU) in PR#6869 achieves - probably not speed, but less power-consumption. My Surface Pro 11 tends to overheat+throttle even with its Snapdragon X Plus, when running all cores at full load with llama.cpp.

I'm going to try building Vulkan-Loader vulkan-1.lib locally. Was it straightforward?

Totally straightforward. I suggest to use the llama.cpp build instructions for WoA (my PR with the description just got merged) to setup VS2022+tools. git-clone Vulkan-Loader. build with cmake ... -D UPDATE_DEPS=ON. Copy the vulkan-1.lib into the VulkanSDK directory (rename the one there). BE CAREFUL with copying the vulkan-1.dll which also gets built - I broke my Windows once because of mismatches between arm64 and x64 in the paths (DLL-hell).

AndreasKunar commented 3 months ago

Current status - I can't get llama.cpp/Vulkan to run under Windows on ARM with Snapdragon X (Surface 11 Pro base model) and give up for now.

When debugging it, there always is an internal exception thrown after the call: ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_mat[GGML_TYPE_Q4_K]->l, "matmul_q4_k_f32_l", matmul_q4_k_f32_len, matmul_q4_k_f32_data, "main", 3, sizeof(vk_mat_mat_push_constants), l_wg_denoms, warptile_mmq_l, l_align); which's cause/issue I cannot debug further. Stepping into the call just gets a Debug Error! abort() has been called somewhere in Vulkan or the C++ runtime. And the call parameters don't look different than the similar ones in the code-lines before.

With the Snapdragon X's Adruino anyway probably not faster than the Q4_0_4_8 CPU acceleration, I'm giving up on llama.cpp/vulkan on the Snapdragon X and close this issue. Q4_0_4_8 on the Snapdragon X CPU has approx. the same performance as Q4_0 on my 10-GPU M2.

I'm shifting to try and work with the ollama team to get ollama to run on WoA with the Snapdragon X and support Q4_0_4_8.