ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.15k stars 8.73k forks source link

ggml : add WebGPU backend #7773

Open ggerganov opened 4 weeks ago

ggerganov commented 4 weeks ago

I hope that this would be relatively easy to do since AFAIK WebGPU allows us to write kernels in a shader language, so we have experience how to create such backends.

There has been some initial work in https://github.com/ggerganov/ggml/pull/585 - could be useful as a starting point

WenheLI commented 2 weeks ago

Hi! I'm interested in bringing this backend to GGML and was wondering if there are any startup materials available for newcomers to quickly ramp up and start working on this backend?

ngxson commented 6 days ago

So I've been playing with implementation of webgpu for a few days. I got a very minimal version with working buffer management and support for some simple ops.

My version is based on https://github.com/ggerganov/ggml/pull/585 but with some noticeable changes:

  1. Up-to-date ggml backend API
  2. Use webgpu_cpp instead of plain C (requires C++17)
  3. emscripten-only for now

However, I'm not very familiar with ggml backend interface so I'm having a question:

I made a test cgraph to test my implementation: https://github.com/ngxson/ggml_webgpu_dev/blob/a5fcc25c359b997869b8683ab485d1d3f96b37f9/main.cpp#L70

When calling ggml_gallocr_alloc_graph, I expect it to call buffer_type_alloc_buffer with enough memory for all nodes, but turns out it only alloc memory for one node then call init_tensor for all nodes:

gml_backend_wgpu_buffer_type_alloc_buffer: 256  ==> only enough memory for one node
storage_buffer_1: create with size=256
ggml_backend_wgpu_buffer_reset
ggml_backend_wgpu_buffer_init_tensor: node_0
storage_buffer_1: node_0, init to offset 0
ggml_backend_wgpu_buffer_init_tensor: node_1
storage_buffer_1: realloc to size=512           ==> not enough memory, we need to realloc
storage_buffer_1: node_1, init to offset 256
ggml_backend_wgpu_buffer_init_tensor: node_2

Here is my tensor_init function: https://github.com/ngxson/ggml_webgpu_dev/blob/a5fcc25c359b997869b8683ab485d1d3f96b37f9/ggml-wgpu.cpp#L195

@ggerganov Could you help me understand this part? Thank you.

slaren commented 6 days ago

If every tensor used in the graph needed to be allocated separately, the compute buffer would be several gigabytes even for the simplest models. The point of ggml-alloc is to minimize the size of the compute buffer by allocating tensors in the same memory locations when possible based on the order of evaluation of the graph. So this behavior is completely expected.

I don't understand what you are trying to do with offset_table. You can calculate the offset of the tensor within the buffer from ggml_tensor::data by subtracting it to the base address returned by ggml_backend_wgpu_buffer_get_base.

ngxson commented 5 days ago

@slaren Thanks for the explanation.

So apparently offset_table was used because I didn't know that offset can be calculated using tensor->data - base. With that in mind, I removed offset_table and also removed the std::set<ggml_wgpu_buffer_context *> buffers used for tracking all the allocated buffers.

I'm now running into another issue that both src and dest of result = ggml_div(ctx0, result, model.b) point to the same tensor:

Writable storage buffer binding aliasing found between [BindGroup "bind_group"] set at bind group index 0, binding index 0, and [BindGroup "bind_group"] set at bind group index 0, binding index 2, with overlapping ranges (offset: 0, size: 256) and (offset: 0, size: 256) in [Buffer "storage_buffer_1"].
 - While encoding [ComputePassEncoder (unlabeled)].DispatchWorkgroups(8, 1, 1).

I'm not sure how other backends handle this (and also the _inplace version). Do you have any clue?

slaren commented 5 days ago

ggml-alloc can make some operations automatically inplace if it determines that it is safe to do so, to save memory. Other backends do not need to do anything special in this case, they just pass the same pointer for both the destination and src. I am not sure why this is a problem for webGPU, in the worst case it might require making a different version of the kernels for inplace operations, but there is probably some workaround possible.