Closed AndreasKunar closed 3 months ago
I think that's the same bug that happens on Snapdragon phones: #5186
Some shader compiler bug in the Adreno driver. Might be good to report it to Qualcomm.
I think that's the same bug that happens on Snapdragon phones: #5186
Some shader compiler bug in the Adreno driver. Might be good to report it to Qualcomm.
@0cc4m thanks a lot. I will look into it.
I tried to run it also on WSL2 with its Microsoft CPU-emulated Vulkan-Driver. There it does not crash, but the results generated are garbage (as well as very slow).
I have tried to reach out to Qualcomm and will see if they answer.
Have you experimented with the link https://apps.microsoft.com/detail/9nqpsl29bfff?hl=en-US&gl=US that facilitates the conversion of Vulkan shaders to D12 shaders?
Have you experimented with the link https://apps.microsoft.com/detail/9nqpsl29bfff?hl=en-US&gl=US that facilitates the conversion of Vulkan shaders to D12 shaders?
@skyan - thanks, its latest version installs automatically on the Surface with Snapdragon X. It implements a Microsoft Vulkan driver, which shows in addition to the Qualcomm Vulkan driver, and both show in vulkaninfo. llama.cpp's Vulkan backend picks the native Qualcomm driver (probably derived from their Android work), which seems to implement more/better Vulkan features than Microsoft's translation-driver to the (Qualcomm provided) native DX12 driver.
I will have a look, if I can test-tweak the llama.cpp Vulkan backend to use the Microsoft translation driver and report here, if I manage it and it works.
@AndreasKunar To try it you just have to manually pick the device, by setting the environment variable GGML_VK_VISIBLE_DEVICES
to the index of the device you want on vulkaninfo --summary
@AndreasKunar To try it you just have to manually pick the device, by setting the environment variable
GGML_VK_VISIBLE_DEVICES
to the index of the device you want onvulkaninfo --summary
@0cc4m and @sykuang - thanks a lot!
Now with Microsoft's Vulkan to DX12 driver selected, the error changes from "llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown" to "llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorOutOfHostMemory". Trace with tinyllama-1.1b/ggml-model-f16.gguf below:
ggml_vk_instance_init() ggml_vulkan: Found 1 Vulkan devices: ggml_vk_print_gpu_info(1) Vulkan0: Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU) (Dozen) | uma: 1 | fp16: 1 | warp size: 64 ggml_vk_get_device(0) Initializing new vk_device ggml_vk_find_queue_family_index() ggml_vk_find_queue_family_index() ggml_vk_create_queue() ggml_vk_load_shaders(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU)) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f32_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_f16_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32) ggml_vk_create_pipeline(Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU), matmul_q4_0_f32_l, main, 3, 56, (128,128,1), specialization_constants, 128) MESA: error: == VALIDATION ERROR ============================================= error: Total Thread Group Shared Memory storage is 33792, exceeded 32768. Validation failed.
That means that the D3D12 translation layer doesn't provide enough shared memory for the big (_l
) matrix multiplication shaders. You might be able to work around that by disabling them for Qualcomm GPUs (I don't think it's likely they are helpful here, they are meant for big dedicated GPUs) and not loading the pipelines in that case.
@0cc4m and @AndreasKunar, I wanted to let you know that thanks to @0cc4m's input, I can run phi-3 on the Snapdragon X platform.
@sykuang - cool! What did you do exactly? Just reduce/edit the kernels (which?) and set GGML_VK_VISIBLE_DEVICES=1 (to the D2D12 driver)?
@AndreasKunar, I've adjusted the code in ggml_vk_load_shaders by commenting out certain sections, which has resolved the issue where DirectX was reporting out-of-memory errors. You can refer to https://github.com/sykuang/llama.cpp/commit/b18e64826af69fe3765cd3b03c8dcd831a6bceca
@AndreasKunar, I've adjusted the code in ggml_vk_load_shaders by commenting out certain sections, which has resolved the issue where DirectX was reporting out-of-memory errors. You can refer to sykuang@b18e648
Thanks, I tried to get it to run, but once I offloaded Phi3-4k layers onto the GPU, either the results got strange or llama-cli crashed.
Vulkan performance on Snapdragon X Plus also was much worse than e.g. Q4_0_4_8 - Q4 vs. Q4_0_4_8 vs. Vulkan:
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | CPU | 10 | pp512 | 68.99 ± 11.74 |
phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | CPU | 10 | tg128 | 26.42 ± 8.15 |
phi3 3B Q4_0_4_8 | 2.03 GiB | 3.82 B | CPU | 10 | pp512 | 208.45 ± 73.38 |
phi3 3B Q4_0_4_8 | 2.03 GiB | 3.82 B | CPU | 10 | tg128 | 34.83 ± 5.59 |
Vulkan0: Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU) (Dozen) | uma: 1 | fp16: 1 | warp size: 64 | model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|---|
phi3 3B Q4_0_4_8 | 2.03 GiB | 3.82 B | Vulkan | 99 | 10 | pp512 | 40.67 ± 0.41 | ||||
phi3 3B Q4_0_4_8 | 2.03 GiB | 3.82 B | Vulkan | 99 | 10 | tg128 | 5.71 ± 0.20 |
build: 081fe431 (3441)
Can you maybe show your performance (llama-bench -p 512 -n 128) results? Currently for me it looks like Q4_0_4_8 quantization using the CPU's MATMUL is much faster than Vulkan.
Q4_0_4_8
is not supported by the Vulkan backend, it will run on the CPU.
Q4_0_4_8
is not supported by the Vulkan backend, it will run on the CPU.
Thanks a lot, mea-culpa, I did not know this. However even the "reduced-kernels" with the Vulkan-Backend and the Vulkan-to-DirectX12 driver as well as the Phi3 Q4 models run out of memory on load for me. The Q4_0_4_8 did at least run llama-bench, and sorry, I never verified the llama-cli results (which I now see, are garbage).
What I'm trying to find out, is if Vulkan/GPU on Snapdragon X for Q4 is (or can be) faster than the Q4_0_4_8 optimized kernels on the CPU.
@AndreasKunar You made further than I did when I last attempted this. I didn't even got as far as building the native ARM64 Vulkan loader. I had to rely to that Vulkan wrapper provided by Microsoft, and yes, full of garbage and really slow :(
Hit me up with anything you want help testing.
I'm going to try building Vulkan-Loader vulkan-1.lib locally. Was it straightforward?
@AndreasKunar You made further than I did when I last attempted this. I didn't even got as far as building the native ARM64 Vulkan loader. I had to rely to that Vulkan wrapper provided by Microsoft, and yes, full of garbage and really slow :(
Hit me up with anything you want help testing.
Thanks!!! My problem is that the native Qualcomm Vulkan driver does not load (according to 0cc4m a similar bug to Android). And the Microsoft Vulkan to DirectX12 translation driver runs out of memory, also seems very slow.
So currently I have given up, because the Q4_0_4_8 acceleration for the Snapdragon X CPU is now nearly as fast as my M2 Mac's 10-core GPU (which should i theory be faster than the Snapdragon's GPU). I am waiting to see, what the work on QNN (Qualcomm NPU) in PR#6869 achieves - probably not speed, but less power-consumption. My Surface Pro 11 tends to overheat+throttle even with its Snapdragon X Plus, when running all cores at full load with llama.cpp.
I'm going to try building Vulkan-Loader vulkan-1.lib locally. Was it straightforward?
Totally straightforward. I suggest to use the llama.cpp build instructions for WoA (my PR with the description just got merged) to setup VS2022+tools. git-clone Vulkan-Loader. build with cmake ... -D UPDATE_DEPS=ON
. Copy the vulkan-1.lib into the VulkanSDK directory (rename the one there). BE CAREFUL with copying the vulkan-1.dll which also gets built - I broke my Windows once because of mismatches between arm64 and x64 in the paths (DLL-hell).
Current status - I can't get llama.cpp/Vulkan to run under Windows on ARM with Snapdragon X (Surface 11 Pro base model) and give up for now.
When debugging it, there always is an internal exception thrown after the call:
ggml_vk_create_pipeline(device, device->pipeline_dequant_mul_mat_mat[GGML_TYPE_Q4_K]->l, "matmul_q4_k_f32_l", matmul_q4_k_f32_len, matmul_q4_k_f32_data, "main", 3, sizeof(vk_mat_mat_push_constants), l_wg_denoms, warptile_mmq_l, l_align);
which's cause/issue I cannot debug further. Stepping into the call just gets a Debug Error! abort() has been called
somewhere in Vulkan or the C++ runtime. And the call parameters don't look different than the similar ones in the code-lines before.
With the Snapdragon X's Adruino anyway probably not faster than the Q4_0_4_8 CPU acceleration, I'm giving up on llama.cpp/vulkan on the Snapdragon X and close this issue. Q4_0_4_8 on the Snapdragon X CPU has approx. the same performance as Q4_0 on my 10-GPU M2.
I'm shifting to try and work with the ollama team to get ollama to run on WoA with the Snapdragon X and support Q4_0_4_8.
What happened?
The new Copilot+PCs with Qualcomm Snapdragon X processors (in my case a Surface 11 Pro with Snapdragon X Plus and 16GB RAM) are fast, and run llama.cpp on the CPU w/o issues. They also include a Vulkan driver and run the Vulkan samples w/o problems. But llama.cpp built with Vulkan does (now finally build,) but not run.
llama-cli is terminating on model-load with: llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown llama_load_model_from_file: failed to load model main: error: unable to load model
Name and Version
llama-cli version: 3378 (71c1121d) with a quick-fix to compile (see #8446), built with MSVC 19.40.33812.0 for ARM64
built with: Installed VulkanSDK for Windows x64, then built a Windows arm64 version of KhronosGroup/Vulkan-Loader vulkan-1.lib (+tested its functionality with tests+samples) and copied it to VulkanSDK lib-directory for llama.cpp building.
What operating system are you seeing the problem on?
Windows
Relevant log output
console output.txt main.log vulkaninfo.txt