ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.85k stars 9.45k forks source link

Add support for accelerating with QNN on Windows on ARM #7541

Closed hmartinez82 closed 2 months ago

hmartinez82 commented 4 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

Please add support for accelerating with Qualcomm QNN on Windows.

Motivation

Every ARM64 laptop since the Surface Pro X has a NPU. It's not the shiny new 40+ TOPS that the Copilot+PCs have, but it's fast enough for llama in certain models. For instance, the Snapdragon 8cx Gen3 has a 15 TOPS NPU and it has support for the operators to accelerate local inference that the CPU lacks (like MATMUL). QNN will be blazing fast on Copilot+PCs too.

Possible Implementation

The QNN SDK is freely available at the Qualcomm Developer website (https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) as Qualcomm AI Engine Direct SDK.

wcwong commented 4 months ago

5079 is related and I'd agree it would be great to have NPU support for the systems that have them.

Microsoft is pushing DirectML.

hmartinez82 commented 4 months ago

@ggerganov Sorry for the trivial question, but the QNN backend doesn't support tensors with different dimensions (this is properly said in their docs). Is this a mandatory requirement of llama.cpp, or is it something that varies by model?

ggerganov commented 4 months ago

QNN backend doesn't support tensors with different dimensions

How come? Pretty much all tensors have different dimensions

hmartinez82 commented 4 months ago

I'm wondering if I interpreted this wrong @ggerganov . Look for the QNN_PROPERTY_TENSOR_SUPPORT_DYNAMIC_DIMENSIONS capability at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/supported_capabilities.html

ggerganov commented 4 months ago

I see, I am not sure about the full implications, but I know that certain hardware have no or poor support for computations in which the shapes change after each call - this is the case for Transformer-based LLMs because some of the tensor shapes grow with the number of tokens in the context. In contrast for example, CNNs used in computer-vision usually have static shapes for any kind of input.

There are tricks we can do to overcome this limitation, but it would make general support for this hardware more difficult, customized and in the realm of "proof-of-concept". Again, I'm not really familiar with the details - it's best if people working on this can analyze the limitations and propose what to do

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.