ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.29k stars 8.76k forks source link

ggml : add DirectML backend #7772

Open ggerganov opened 1 month ago

ggerganov commented 1 month ago

It seems like DirectML supports the upcoming NPU-enabled chips for Windows machines: https://devblogs.microsoft.com/directx/introducing-neural-processor-unit-npu-support-in-directml-developer-preview/

I don't think there is any other way to tap into this hardware, so we should explore if it possible to add this library as a backend in ggml in order to run stuff on the NPUs. There has been some semi-related work in the past that combined ggml and Direct3D: https://github.com/Const-me/Whisper. Not sure if it is relevant at all, maybe just as an inspiration

arch-btw commented 1 month ago

Great idea, it looks like a lot of the upcoming AI hardware is going to have NPUs.

slaren commented 1 month ago

I am not convinced that a DirectML backend is possible, the operators are too high level and new ones cannot be added. This means that we cannot implement a matrix multiplication operator that supports our quant formats. It might be possible to do it with DirectX 12 shaders, but at that point it would be a DirectX 12 backend more than a DirectML backend. It would not allow using onnx models regardless.

ggerganov commented 1 month ago

Would using DirectX 12 shaders allow us to run stuff on the NPU? I suppose no, but just making sure. The main point of a potential DirectML backend would be to utilize the NPU. If it is too high-level (i.e. something like what CoreML is on Apple Silicon) then I agree it's not worth or possible to add support for it

slaren commented 1 month ago

I am not sure it is possible to create custom NPU kernels at all. https://github.com/openvinotoolkit/npu_plugin seems to contain a compiler for the Intel NPU, but it's not clear if it is complete, and they have removed the source of the kernels that should be located in https://github.com/openvinotoolkit/npu_plugin/tree/develop/sw_runtime_kernels, leaving only the binary blobs.

sinni800 commented 1 day ago

Interesting, but they have a PyTorch implementation, I thought PyTorch is pretty diverse in what it can support, but I have not much insight into the base here. Or is PyTorch automatically supplementing what isn't supported by DirectML with CPU?