Subtle Vulkan shader compilation bug when running on Adreno GPUs (Samsung Galaxy S23 Ultra)

l3utterfly commented 7 months ago

GPU info:

QUALCOMM build          : 7b26bdd942, Iab69c31769
                                                                                                    Build Date              : 08/28/23
                                                                                                    Shader Compiler Version : E031.41.03.44
                                                                                                    Local Branch            : 
                                                                                                    Remote Branch           : refs/tags/AU_LINUX_ANDROID_LA.VENDOR.13.2.0.11.00.00.855.659
                                                                                                    Remote Branch           : NONE
                                                                                                    Reconstruct Branch      : NOTHING
2024-01-29 16:01:57.278 11189-11514 AdrenoVK-0              com.layla                            I  Build Config            : S P 14.1.4 AArch64
2024-01-29 16:01:57.278 11189-11514 AdrenoVK-0              com.layla                            I  Driver Path             : /vendor/lib64/hw/vulkan.adreno.so
2024-01-29 16:01:57.278 11189-11514 AdrenoVK-0              com.layla                            I  Driver Version          : 0676.42
2024-01-29 16:01:57.278 11189-11514 AdrenoVK-0              com.layla                            I  PFP                     : 0x01740158
2024-01-29 16:01:57.278 11189-11514 AdrenoVK-0              com.layla                            I  ME                      : 0x00000000
2024-01-29 16:01:57.278 11189-11514 AdrenoVK-0              com.layla                            I  Application Name    : ggml-vulkan
                                                                                                    Application Version : 0x00000001
                                                                                                    Engine Name         : (null)
                                                                                                    Engine Version      : 0x00000000
                                                                                                    Api Version         : 0x00402000

In the file: ggml_vk_generate_shaders.py:640: dequant_q4_K_body

The following DOES NOT WORK:

const int y_idx = i * QUANT_K + 64 * il + n * ir;
        const int qs_idx = 32*il + n * ir;

        uint8_t sc;
        uint8_t m;
        if (is < 4) {
            sc = uint8_t(data_a[i].scales[is] & 63);
            m  = uint8_t(data_a[i].scales[is + 4] & 63);
        } else {
            sc = uint8_t((data_a[i].scales[is + 4] & 0xF) | ((data_a[i].scales[is - 4] >> 6) << 4));
            m  = uint8_t((data_a[i].scales[is + 4] >>  4) | ((data_a[i].scales[is    ] >> 6) << 4));
        }
        const FLOAT_TYPE d1 = dall * sc;
        const FLOAT_TYPE m1 = dmin * m;

        if (is < 4) {
            sc = uint8_t(data_a[i].scales[is + 1] & 63);
            m  = uint8_t(data_a[i].scales[is + 5] & 63);
        } else {
            sc = uint8_t((data_a[i].scales[is + 5] & 0xF) | ((data_a[i].scales[is - 3] >> 6) << 4));
            m  = uint8_t((data_a[i].scales[is + 5] >>  4) | ((data_a[i].scales[is + 1] >> 6) << 4));
        }
        const FLOAT_TYPE d2 = dall * sc;
        const FLOAT_TYPE m2 = dmin * m;

        [[unroll]] for (int l = 0; l < n; ++l) {
            data_b[y_idx + l     ] = D_TYPE(d1 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] & 0xF) - m1);
            data_b[y_idx + l + 32] = D_TYPE(d2 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] >>  4) - m2);

This crashes with the issue: Shader compilation failed for shaderType: 5.

The workaround appears to be "tail-ing" the if-branches (i.e. duplicating the code so the branches do not converge inside the loop).

const int y_idx = i * QUANT_K + 64 * il + n * ir;
        const int qs_idx = 32*il + n * ir;

        uint8_t sc;
        uint8_t m;
        if (is < 4) {
            sc = uint8_t(data_a[i].scales[is] & 63);
            m  = uint8_t(data_a[i].scales[is + 4] & 63);

            const FLOAT_TYPE d1 = dall * sc;
            const FLOAT_TYPE m1 = dmin * m;

            if (is < 4) {
                sc = uint8_t(data_a[i].scales[is + 1] & 63);
                m  = uint8_t(data_a[i].scales[is + 5] & 63);

                const FLOAT_TYPE d2 = dall * sc;
                const FLOAT_TYPE m2 = dmin * m;

                [[unroll]] for (int l = 0; l < n; ++l) {
                    data_b[y_idx + l     ] = D_TYPE(d1 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] & 0xF) - m1);
                    data_b[y_idx + l + 32] = D_TYPE(d2 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] >>  4) - m2);
                }
            } else {
                sc = uint8_t((data_a[i].scales[is + 5] & 0xF) | ((data_a[i].scales[is - 3] >> 6) << 4));
                m  = uint8_t((data_a[i].scales[is + 5] >>  4) | ((data_a[i].scales[is + 1] >> 6) << 4));

                const FLOAT_TYPE d2 = dall * sc;
                const FLOAT_TYPE m2 = dmin * m;

                [[unroll]] for (int l = 0; l < n; ++l) {
                    data_b[y_idx + l     ] = D_TYPE(d1 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] & 0xF) - m1);
                    data_b[y_idx + l + 32] = D_TYPE(d2 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] >>  4) - m2);
                }
            }
        } else {
            sc = uint8_t((data_a[i].scales[is + 4] & 0xF) | ((data_a[i].scales[is - 4] >> 6) << 4));
            m  = uint8_t((data_a[i].scales[is + 4] >>  4) | ((data_a[i].scales[is    ] >> 6) << 4));

            const FLOAT_TYPE d1 = dall * sc;
            const FLOAT_TYPE m1 = dmin * m;

            if (is < 4) {
                sc = uint8_t(data_a[i].scales[is + 1] & 63);
                m  = uint8_t(data_a[i].scales[is + 5] & 63);

                const FLOAT_TYPE d2 = dall * sc;
                const FLOAT_TYPE m2 = dmin * m;

                [[unroll]] for (int l = 0; l < n; ++l) {
                    data_b[y_idx + l     ] = D_TYPE(d1 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] & 0xF) - m1);
                    data_b[y_idx + l + 32] = D_TYPE(d2 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] >>  4) - m2);
                }
            } else {
                sc = uint8_t((data_a[i].scales[is + 5] & 0xF) | ((data_a[i].scales[is - 3] >> 6) << 4));
                m  = uint8_t((data_a[i].scales[is + 5] >>  4) | ((data_a[i].scales[is + 1] >> 6) << 4));

                const FLOAT_TYPE d2 = dall * sc;
                const FLOAT_TYPE m2 = dmin * m;

                [[unroll]] for (int l = 0; l < n; ++l) {
                    data_b[y_idx + l     ] = D_TYPE(d1 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] & 0xF) - m1);
                    data_b[y_idx + l + 32] = D_TYPE(d2 * FLOAT_TYPE(data_a[i].qs[qs_idx + l] >>  4) - m2);
                }
            }
        }

This workaround is tested to succeed when compiling for the Adreno GPU in Samsung Galaxy S23 Ultra.

This seems to indicate a subtle bug in the adreno shader compiler? Does anyone know what's going on?

0cc4m commented 7 months ago

Is there even a need to compile the shaders on Android? The Adreno GPU driver should be able to just read the SPIR-V code in ggml-vulkan-shaders.hpp, right?

l3utterfly commented 7 months ago

ggml_vk_generate_shaders.py generates byte code into -> ggml-vulkan-shaders.hpp which is then used by -> ggml-vulkan, right?

ggml-vulkan then creates a pipeline on the GPU using the code in ggml-vulkan-shaders. This is the step that crashed with the error above.

I had to edit the shader code in the python file, regenerate the hpp file, then re-run the program to fix it.

0cc4m commented 7 months ago

Interesting. Does it run afterwards or crash at some other point? If these changes fix the shader for you and cause no problem with other devices, you could open a PR. But yeah, looks like a driver bug.

l3utterfly commented 7 months ago

After this change, all the pipelines compile without issue (there's a few similar cases which I edited in the q5 dequants).

Unfortunately.. I'm running into an issue at inference time with the error vk::queue Error device lost.

I know it's vague, any idea what might be causing it?

I have not been able to narrow down the follow-up issue. Is it caused simply by taking too much time during inference? or a memory access exception or some other in the shader?

Additionally, after this change, the tinyllama model becomes incoherent when running with 4 layers offloaded to the GPU. Running with all layers offloaded gives the device lost error.

0cc4m commented 7 months ago

Sorry, I forgot to reply to you. DeviceLost is just a generic error meaning the driver failed/crashed. It wouldn't be caused by taking too long, but it's hard to tell what's going on without debugging the device directly.

If you check out #5301, you can set the build parameter LLAMA_VULKAN_VALIDATE to enable validation layers, which might show a problem. You can also set LLAMA_VULKAN_DEBUG to get a (very verbose) output of what the Vulkan backend is doing. This could tell you where it crashes. If it doesn't crash, but it's incoherent, then LLAMA_VULKAN_CHECK_RESULTS will try to run a check after each op and compare with the CPU result to see where it goes wrong.

l3utterfly commented 6 months ago

@0cc4m I did some more debugging and have made some progress. It seems on android devices there's no dedicated transfer queue.

I believe this is the reason why the output becomes incoherent: the code currently assumes there's always a dedicated transfer queue, and crashes if there's not. I adjusted the code to allow the use of compute queue as the transfer queue. I believe the rest of the code does not account for this and there's no suitable synchronisation mechanism? This is why the output becomes incoherent?

Could this be a possibility? I'm not too familiar with vulkan shader coding.

l3utterfly commented 6 months ago

To add to the above, when I run with LLAMA_VULKAN_CHECK_RESULTS, and there's no Error device lost (also output is coherent). But running without validation gives the device lost error. I think it's because the CPU checking code is forcing sync.

l3utterfly commented 6 months ago

@0cc4m another finding after I did some further debugging.

If I set last_node to true in ggml_vk_build_graph like this:

// Force context reset on each node so that each tensor ends up in its own context
// and can be run and compared to its CPU equivalent separately
last_node = true;

The shaders do not crash even with the LLAMA_VULKAN_CHECK_RESULTS off. I am still trying to understand the code. Does this information shed any light into this error? Appreciate any insight you can give me here.

0cc4m commented 6 months ago

Interesting. That means that putting the whole graph into one command buffer (which is definitely good for speed on desktop GPUs) is too much for your driver, for whatever reason. Maybe we should add an environment or compile parameter to limit the amount of ops per command buffer?

akingoverlook commented 6 months ago

There are many things going wrong with the Vulkan backend for Adreno (all of them).

They do not advertise the TRANSFER bit on any queue, which breaks the device detection logic. Easy enough to fix, as by all accounts that bit can be implied.
The dequant_q4_K and dequant_q5_K will choke the driver, it will barf with UNKNOWN error. Not a huge loss either, they can be skipped. Why just those 2 shaders, is hard to say. Maybe they were compiled differently (Adreno used to only like one of the compilers), or maybe they use some Uniform class objects (Qualcomm's Vulkan does not support "uniformAndStorageBuffer{8|16}BitAccess" - which of course also breaks the Kompute backend).
The maximum allocation limit is just 1GB, even on the Adreno 750. This is in contrast to probably every other platform with UMA - even ARM Mali allows full RAM size. The result is, you can only raise your ngl so much ... it will work until you give it 12-14 layers, depending on the model size and quantization. Above that it will die with DEVICE_LOST. Of course, it will be slow with a split.
The suggestion above with last_node = true does fix the DEVICE_LOST. But it runs at a snail pace of 0.5 tokens/s.
On the bright side, the output is (usually) coherent though the backend does fail a bunch of correctness tests, even on other platforms (e.g., older NVIDIA).

It is quite clear that Qualcomm drivers were never meant to be used in this fashion and need some improvements. Maybe something can be done in the backend to sidestep those issues

l3utterfly commented 6 months ago

2. The dequant_q4_K and dequant_q5_K will choke the driver, it will barf with UNKNOWN error. Not a huge loss either, they can be skipped. Why just those 2 shaders, is hard to say. Maybe they were compiled differently (Adreno used to only like one of the compilers), or maybe they use some Uniform class objects (Qualcomm's Vulkan does not support "uniformAndStorageBuffer{8|16}BitAccess" - which of course also breaks the Kompute backend).

This https://github.com/ggerganov/llama.cpp/issues/5186#issue-2104800300 apparently fixes it so at least it does not crash at compile time. Any idea what's going on?

akingoverlook commented 6 months ago

Perhaps I was not clear enough. Don't think anything has really "fixed" it, at least not in the master. It isn't a compile time problem. The ggml_vk_create_pipeline() will fail with the infamous -13 error coming from the underlying Vulkan API. Just on those 2 shaders, and just on Adreno.

I also got someone to run a test on MTK D9300 for me (that would be top of the line ARM Immortalis GPU). It did run, as far as handling a basic prompt goes, but still slower than CPU, and still crashed with DEVICE_LOST on llama-bench.

This leads me to think that the (relatively) low memory bandwidth of the mobile chipsets is probably killing any performance advantage of GPUs. Attempts to offload any ops out of CPU just make things worse due to overhead involved in the setup and synchronization.

Would be interesting to try the kompute backend (which is supposedly offloading the entire graph), but it wants things that aren't supported on any of the mobile chipsets.

akingoverlook commented 6 months ago

Interesting. That means that putting the whole graph into one command buffer (which is definitely good for speed on desktop GPUs) is too much for your driver, for whatever reason. Maybe we should add an environment or compile parameter to limit the amount of ops per command buffer?

I have played around with limiting the number of ops for submission, and it certainly makes progress. Many small (2B/3B/4B) models can run (with number of ops per buffer capped around 100), and though the inference speed is still below CPU, at least it is in the same ballpark. Larger models still usually die with DEVICE_LOST from Vulkan, and it isn't exactly clear how to select the correct limit - some models can run with a higher one, others will crash with a lower one.

I think the "whatever reason" is clearly related to the Vulkan max allocation size. Reducing the number of offloaded layers has the same effect of limiting the number of ops per buffer, so it makes sense that the backend also works when ngl is set low enough.

Perhaps Adreno does not like command buffers that contain tensors that span more memory than the max allocation size. At least intuitively it feels that way from all the experimentation. This is probably a unique scenario, since with other GPUs you either have a dedicated VRAM that needs to be large enough to fit the selected number of layers, or in the case of UMA, they have a max allocation size that is useful to fully offload at least small models (that seems to be the case for Intel and ARM GPUs). Adreno is the only one that is UMA with a small max allocation (1GB) and probably needs specific treatment, like METAL.

l3utterfly commented 6 months ago

Can you share the change where you can set the number of Ops per buffer? I would love to test this on my device as well

akingoverlook commented 6 months ago

Can you share the change where you can set the number of Ops per buffer? I would love to test this on my device as well

This is very crude and probably wrong on many levels. The trick to getting some semi-decent TG rate was to not always restrict the buffer, but only when it matches some criteria. To get some idea of the criteria, I first started dumping the ops and counting their numbers and associated buffer sizes. I suspect that I am counting it wrong, but by accident or luck there was some semblance of a pattern - shit seems to hit the fan when the sum of buffer sizes gets too close to the total heap size (your total available RAM size, or check what vulkaninfo reports). Trying to do something more sophisticated killed my evening but did not produce anything more useful. This at least gets you some small models like phi, stablelm, gemma, to experiment with.

This whole thing goes in place of just "last_node = true" ;)

#define VK_VENDOR_ID_QUALCOMM 0x5143
uint32_t g_node_count = 0;
uint64_t g_buffer_size = 0;
uint64_t g_guard_size = 0;
uint64_t g_total_heap = 0;
uint64_t g_free_heap = 0;

    // Adreno has limited maxMemoryAllocation (1GB) and will die when too many layers are offloaded
    if (ctx->device.lock()->vendor_id == VK_VENDOR_ID_QUALCOMM) {
        g_node_count++;
        //std::cerr << "tensor: " << node->name << ", op: " << ggml_op_name(node->op) << "[" << ggml_nbytes(node) << "]" << std::endl; 
        g_buffer_size += ggml_backend_buffer_get_size(node->buffer);
        if (last_node) {
                //std::cerr << "*** VK tensors [" << g_node_count << ", " << g_buffer_size << "] ***" << std::endl;
                g_node_count = 0;
                g_buffer_size = 0;
                g_guard_size = 0;
        }
        if (g_guard_size == 0)
                g_guard_size = ggml_backend_buffer_get_size(node->buffer) * 2; // just some margin
        if (g_total_heap == 0)
                ggml_backend_vk_get_device_memory(0, &g_free_heap, &g_total_heap);
        if (g_buffer_size + g_guard_size >= g_total_heap) {
                //std::cerr << "*** VK tensors [" << g_node_count << ", " << g_buffer_size << "] ***" << std::endl;
                last_node = true;
                g_node_count = 0;
                g_buffer_size = 0;
                g_guard_size = 0;
        }
    }

woachk commented 5 months ago

Something quite odd is that I definitely got much higher perf than CPU on Adreno 690 (Snapdragon 8cx Gen 3) using the Kompute backend on Windows. That said, I still had to do a number of hacks there, especially as the QCOM driver only pretends to be Vulkan 1.1

akingoverlook commented 5 months ago

Something quite odd is that I definitely got much higher perf than CPU on Adreno 690 (Snapdragon 8cx Gen 3) using the Kompute backend on Windows. That said, I still had to do a number of hacks there, especially as the QCOM driver only pretends to be Vulkan 1.1

The "cx" flavor is not the same thing at all (as the "regular" SD 8 Gen 3), and uses a different SW baseline. The vulkan driver must be different, because the "regular" one does not even pass the compatibility checks with the kompute backend (lacks UniformAndStorageBuffer support). Unless something has changed lately.

Can you post output of vulkaninfo from the 8cx using this script:

vulkaninfo | grep -e deviceName -e deviceID -e vendorID -e apiVersion

vulkaninfo | grep -e shaderFloat16 -e shaderInt8 -e shaderInt16 \
        -e storageBuffer -e uniformAndStorageBuffer \
        -e maxMemoryAllocationSize -e maxComputeSharedMemorySize \
        -e minSubgroupSize -e maxSubgroupSize \
        -e queueFlags

vulkaninfo | grep -A1 -e memoryHeaps

woachk commented 5 months ago

@akingoverlook I use a hacked up version of the Kompute backend w/ most of the feature checks skipped.

8cx Gen 3 uses a GPU that is a variant of what came in the Snapdragon 888 (but made significantly bigger) - it's the swan song of the Adreno 6xx architecture line before Snapdragon 8 Gen 1 switched to Adreno 7xx. And X Elite uses a variant of the Snapdragon 8 Gen 2 GPU architecture.

https://gist.github.com/woachk/36d2c7ffd3e7b08c38800cbfcea044ea - note that the Vulkan driver binaries on Windows are unified between a6xx and a7xx

woachk commented 5 months ago

And it turns out that the CPU is much faster - when running in WSL - but not on Windows directly. Will have to see what's going on there in llama.cpp to show that behavior.

akingoverlook commented 5 months ago

@akingoverlook I use a hacked up version of the Kompute backend w/ most of the feature checks skipped.

8cx Gen 3 uses a GPU that is a variant of what came in the Snapdragon 888 (but made significantly bigger) - it's the swan song of the Adreno 6xx architecture line before Snapdragon 8 Gen 1 switched to Adreno 7xx. And X Elite uses a variant of the Snapdragon 8 Gen 2 GPU architecture.

https://gist.github.com/woachk/36d2c7ffd3e7b08c38800cbfcea044ea - note that the Vulkan driver binaries on Windows are unified between a6xx and a7xx

I don't know if it is safe to just skip those checks. Someone did put them in for a reason, so if you don't have UnifiedStorageAndBuffer but the backend actually relies on that, you are bound to run into something weird later.

Unless you also hacked up the kompute shaders to not need that?

woachk commented 5 months ago

@akingoverlook FP16 inference still works w/ the checks skipped. q4_0 ones definitely break, but that's something I'm planning to look at later hopefully. (shaderInt8 is exposed but int8 storage buffers are not). With the Kompute backend (the other one doesn't work):

Perf numbers using: main.exe -ngl 999 -m D:\Downloads\ggml-model-f16.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

llama_print_timings:        load time =    4777.78 ms
llama_print_timings:      sample time =     338.93 ms /   400 runs   (    0.85 ms per token,  1180.20 tokens per second)
llama_print_timings: prompt eval time =    1207.33 ms /    19 tokens (   63.54 ms per token,    15.74 tokens per second)
llama_print_timings:        eval time =   52685.29 ms /   399 runs   (  132.04 ms per token,     7.57 tokens per second)
llama_print_timings:       total time =   57399.03 ms /   418 tokens
Log end

only f16 works there, not q4_0 (tested with https://huggingface.co/ggml-org/models/blob/main/tinyllama-1.1b/ggml-model-f16.gguf)

~7.5 tok/sec @ 1.1b

akingoverlook commented 5 months ago

@akingoverlook FP16 inference still works w/ the checks skipped. q4_0 ones definitely break, but that's something I'm planning to look at later hopefully. (shaderInt8 is exposed but int8 storage buffers are not). With the Kompute backend (the other one doesn't work):

Perf numbers using: main.exe -ngl 999 -m D:\Downloads\ggml-model-f16.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
llama_print_timings:        load time =    4777.78 ms
llama_print_timings:      sample time =     338.93 ms /   400 runs   (    0.85 ms per token,  1180.20 tokens per second)
llama_print_timings: prompt eval time =    1207.33 ms /    19 tokens (   63.54 ms per token,    15.74 tokens per second)
llama_print_timings:        eval time =   52685.29 ms /   399 runs   (  132.04 ms per token,     7.57 tokens per second)
llama_print_timings:       total time =   57399.03 ms /   418 tokens
Log end
only f16 works there, not q4_0 (tested with https://huggingface.co/ggml-org/models/blob/main/tinyllama-1.1b/ggml-model-f16.gguf)

~7.5 tok/sec @ 1.1b

Wow, those numbers look good, but then I realize it is only tinylama, and I can run gemma-2b on CPU faster than that ;) Now I am intrigued. Your vulkaninfo output is a surprise too, it is v1.1, so it does not even report some of those fields. I am not sure where to look for the max allocation size there, do you know what it is?

The "regular" SD8 is v1.3 but stuck with 1GB max allocation, and it looks like the regular vulkan backend is hopelessly broken with that, so I was not holding much hope for the kompute either.

woachk commented 5 months ago

@akingoverlook FP16 inference still works w/ the checks skipped. q4_0 ones definitely break, but that's something I'm planning to look at later hopefully. (shaderInt8 is exposed but int8 storage buffers are not). With the Kompute backend (the other one doesn't work): Perf numbers using: main.exe -ngl 999 -m D:\Downloads\ggml-model-f16.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
llama_print_timings:        load time =    4777.78 ms
llama_print_timings:      sample time =     338.93 ms /   400 runs   (    0.85 ms per token,  1180.20 tokens per second)
llama_print_timings: prompt eval time =    1207.33 ms /    19 tokens (   63.54 ms per token,    15.74 tokens per second)
llama_print_timings:        eval time =   52685.29 ms /   399 runs   (  132.04 ms per token,     7.57 tokens per second)
llama_print_timings:       total time =   57399.03 ms /   418 tokens
Log end
only f16 works there, not q4_0 (tested with https://huggingface.co/ggml-org/models/blob/main/tinyllama-1.1b/ggml-model-f16.gguf) ~7.5 tok/sec @ 1.1b
Wow, those numbers look unbelievably good, but then I realize it is only tinylama ;) Now I am intrigued. Your vulkaninfo output is a surprise too, it is v1.1,

Yeah, the Qualcomm driver reports 1.1 when running on Adreno 6xx GPUs, and 1.3 on 7xx. For the very same driver binary.

so it does not even report some of those fields. I am not sure where to look for the max allocation size there, do you know what it is?

Didn't test with Vulkan yet, but D3D12 has a quite high alloc limit - was able to allocate a contiguous 4GB just fine.

The "regular" SD8 is v1.3 but stuck with 1GB max allocation, and it looks like the regular vulkan backend is hopelessly broken with that, so I was not holding much hope for the kompute either.

I think that Kompute should be what's tested in this case - with limited attention on the other Vulkan backend - as that doesn't seem to interact properly with QCOM hw.

The changes that I applied to the Kompute repo:

diff --git a/src/Manager.cpp b/src/Manager.cpp
index 0c588e1..71bcd00 100644
--- a/src/Manager.cpp
+++ b/src/Manager.cpp
@@ -420,13 +420,10 @@ Manager::createDevice(const std::vector<uint32_t>& familyQueueIndices,
     features.shaderInt16 = true;

     vk::PhysicalDeviceVulkan11Features features11;
-    features11.uniformAndStorageBuffer16BitAccess = true;
     features11.storageBuffer16BitAccess = true;
     features11.pNext = nullptr;

     vk::PhysicalDeviceVulkan12Features features12;
-    features12.storageBuffer8BitAccess = true;
-    features12.uniformAndStorageBuffer8BitAccess = true;
     features12.shaderFloat16 = true;
     features12.shaderInt8 = true;
     features12.pNext = &features11;

and the same thing (with also lowering the Vulkan version check to 1.1) on the llama.cpp side.

If running on older QCOM drivers you might also want to switch ${glslc_executable} --target-env=vulkan1.2 to 1.1, but newer drivers accept 1.2/1.3 shaders even when the driver pretends to be an 1.1 one.

woachk commented 5 months ago

As a side note, the 1GB per alloc should be able to be worked around through doing multiple allocations right? If yes, it shouldn't be a problem?

akingoverlook commented 5 months ago

As a side note, the 1GB per alloc should be able to be worked around through doing multiple allocations right? If yes, it shouldn't be a problem?

In theory, yes. And the vulkan backend author believes that is already handled. But it dies (with vulkan devicelost error) any time you try to offload too many layers. I have even tried to use the newly introduced sharding feature, and it still dies the same, so there is something deeper going on.

I know the vulkan backend does work on Mali chips, and the only notable differences in the vulkaninfo are the max allocation size and subgroup size, so my suspicion is on the max allocation size. Oddly, the kompute backend would not work with Mali due to their small subgroup size ;)

woachk commented 5 months ago

Hm, I believe that it shouldn't be that hard to fix up the Kompute backend for Mali. But even then from what I see so far, looks like Vulkan is quite suboptimal for ML inference. Or at the very least, that different shaders depending on feature support or vendor extensions will be needed.

akingoverlook commented 5 months ago

Hm, I believe that it shouldn't be that hard to fix up the Kompute backend for Mali. But even then from what I see so far, looks like Vulkan is quite suboptimal for ML inference. Or at the very least, that different shaders depending on feature support or vendor extensions will be needed.

Vulkan is just a low level GPU framework. It isn't optimal or sub-optimal on its own - that is the matter of writing the shaders in a way that works best for the given HW. Vulkan does promise functional portability, but not performance portability. Most shaders aren't written for the mobile GPUs which tend to have numerous and very specific performance tricks to follow. It gets complex enough that you really need specific kernels/shaders written for them. Kompute is a layer sitting above Vulkan, so it does not magically solve that either.

Qualcomm at one point (couple years ago) decided that instead of hoping that everyone would read their documentation and write the shaders in their recommended way, they would just create a vendor OpenCL extension (CLML) that directly implements common ML functions in a way that is optimal for Adreno.

That approach also has a problem of keeping up with the ML field evolving too quickly. New quantization schemes get invented and the old libraries aren't compatible. Apparently TVM was using CLML at some point, but looks like QC itself just rewrote their kernels for TVM again recently.

Why are they focusing on OpenCL and not Vulkan, I don't know. Probably because Vulkan is just pain, and Kompute is too new and not that well established yet. But then they aren't really focusing on GPU at all, they prefer to use their Hexagon DSP (which they also call HTP or NPU). Which has its own bucket of problems, of course, but at least it does not make your UI stutter ;)