ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.98k stars 9.62k forks source link

Add GPU support to ggml #914

Closed ggerganov closed 1 year ago

ggerganov commented 1 year ago

Intro

This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility.

First, I don't see adding a GPU framework that is tightly integrated with ggml anytime soon because it usually comes with a lot of maintenance drawbacks, architecture changes and issues. However, there is an alternative approach that might be relatively easy to implement and I think would be a very cool way for new developers to join in and help.

Description

ggml produces computation graphs which are basically directed acyclic graphs (DAGs) that can be easily exported, iterated, etc. A graph contains the information about all necessary tensor operations and buffers needed to evaluate the model. The idea is to first add basic ggml functionality for exporting the graphs in some trivial text format that can be parsed as a second step by a separate ggml tool. Having the exported graphs, one can process them and construct hardware-specific code for evaluating them.

For example, a ggml-cuda tool can parse the exported graph and construct the necessary CUDA kernels and GPU buffers to evaluate it on a NVIDIA GPU. Another tool, for example ggml-mps, can do similar stuff but for Metal Performance Shaders. Etc.

This approach preserves the cross-platform nature of ggml and allows custom hardware support, via compiler-like translation of the exported computation graphs.

Still, the most difficult part of implementing the respective kernels remains the biggest obstacle.

I think this decoupled approach of the implementation would make the development process much easier and can potentially allow for some interesting optimizations. My biggest fear of adding a tightly integrated GPU backend to ggml is that I don't know the important details for supporting the respective backend, which could lead to bad software design decisions that in turn can potentially affect negatively even the cure CPU implementation. However, with the proposed approach in this issue, we eliminate this risk and allow multiple independent implementations to be provided without any negative side effects on the core ggml implementation.

Another cool thing about this idea is that there could be separate leading developers for each backend. So if you have a good knowledge and understanding about a certain hardware architecture, you are one step away from initiating the kernel "translation" process and making a very significant contribution to the project.

Guiding principles

I don't know all the specifics of a GPU code, but I believe one could try to adopt the fundamental principles of ggml. For example, there could be a single memory buffer allocated and all the tensors can be distributed within that memory buffer at certain offsets. Each graph operation will correspond to a kernel with source tensors as input and a destination tensor for output which will be all part of that single memory buffer allocated at the start of the execution.

Additionally, I think we don't need to explicitly add 3rd party dependencies (e.g. CUDA SDK, OpenCL, etc.) to ggml to achieve that. The new ggml tools will simply generate code, which will be up to the user to compile and run.

I've heard the concept of "super-shaders" / "super-kernels" - probably this is something we should try to achieve.

Taking shortcuts and making custom hacks in favor of better performance is very welcome.

Why?

Currently, ggml is one of the few ML frameworks that provides efficient 4-bit quantization and demonstrates effective application for transformer evaluation. The code is compact, easily comprehensible with very little bloat. I think ggml has a slight leading edge in this regard compared to other general purpose frameworks and if we utilize it now, it has the potential of becoming a very respectable machine learning framework in the future with a focus for on-device inference.

Links

clxyder commented 1 year ago

Would it be possible to use https://github.com/openai/triton/ to generate the specific backend GPU code? From what I can tell it generates CUDA code for you.