Closed bfrasure closed 1 year ago
You can see the author's previous notes on attempting to use the M1 GPU on GPT-J:
https://github.com/ggerganov/ggml/tree/master/examples/gpt-j#attempt-to-use-the-m1-gpu
Basically he found that with Apple's unified memory architecture, the bottleneck is memory bandwidth rather than pure compute.
@evanmiller This info is outdated (and likely wrong) - I am working on offloading the full computation on the M1 GPU with custom kernels and hope to do it properly this time around.
Can't wait for this, @ggerganov
I've been trying to understand more about MPS and I found a few resources that helped.
Philip Turner has been doing interesting work with Metal:
Also, I learned a lot about the limitations of MPS from this pytorch thread, titled "MPS device appears much slower than CPU on M1 Mac Pro." It's an old thread but it still has current activity:
https://github.com/pytorch/pytorch/issues/77799
I thought I would leave these links in case it helps someone else.
Context size 2048, 512 tokens, LLaMa 6.7B (3.9 GB)
Layers on Accelerate | Layers on CLBlast | Sequential Throughput/Token | Bandwidth |
---|---|---|---|
32 | 0 | 40590 µs | 96 GB/s |
30 | 2 | 58150 µs | 67 GB/s |
28 | 4 | 76310 µs | 51 GB/s |
24 | 8 | 113460 µs | 34 GB/s |
16 | 16 | 198330 µs | 20 GB/s |
0 | 32 | 253050 µs | 15 GB/s |
Theoretically it should be able to utilize more bandwidth. I think I can make this an order of magnitude faster. It will require a ton of tuning to align memory transactions and utilize ~376 GB/s of the bandwidth. Use triangular FlashAttention for long context, with dynamic work redistribution to keep GPU cores fully utilized.
Variant | Layers on Accelerate | Sequential Throughput/Token | Bandwidth |
---|---|---|---|
6.7B | 32 | 40590 µs | 96 GB/s |
13.0B | 40 | 75370 µs | 103 GB/s |
32.5B | 60 | 173540 µs | 112 GB/s |
65.2B | 80 | 34842780 µs | 1 GB/s |
Also, has anyone considered lane compression, so I can run 65B-q4 on 32 GB RAM with reasonable speed?
I'm trying to fill in this table. Next is the PyTorch MPS fork of LLaMa.cpp that's slower than CPU.
Latency per 512 tokens:
LLaMa | 6.7B | 13.0B | 32.5B | 65.2B |
---|---|---|---|---|
PyTorch | ||||
LLaMa.cpp | 20.8 s | 38.6 s | 88.9 s | 17839.5 s |
MLC LLM | ||||
MPSGraph | ||||
Metal FlashAttention | ||||
Theoretical Lower Bound | 4.4 s | 8.6 s | 21.6 s | 43.3 s |
Very happy to see @philipturner in this thread. @ggerganov , philipturner is an expert with MPS, M1 architecture, and GPU computation in general. I believe there is a lot of hard-won esoteric knowledge when it comes to optimizing for this architecture. So excited to see where this leads.
I want an honest comparison. Please make LLaMa.cpp as fast as possible; I think I can make something faster. An open-source successor to MPS.
optimizing for this architecture
Why not AMD and Intel too?
@philipturner have you take a look at https://github.com/mlc-ai/web-llm especially https://github.com/mlc-ai/mlc-llm ? It can run apple silicon gpu pretty fast using webgpu. I tested it and it seems as fast as llama cpp but with no heat
have you take a look at https://github.com/mlc-ai/web-llm
I tested it and it seems as fast as llama cpp but with no heat
Can you quantify how much faster?
@philipturner yes, mlc llm is as fast as llama cpp, like I said above
I suspect it might be a slight bit slower, just like PyTorch MPS. It's one thing to use the GPU. It's another to use it right.
If you use Metal the way it's designed, it should not be 10% slower than CPU, but 300% faster.
This is why I asked specifically for a number. Is it 0% faster? 50% faster? 50% slower?
Its about the same 😄
I was surprised as well, because I alread read about your work, btw I'm using M2 pro 16GB, and latest ventura dont know if that make a difference, have you tried it and the result is slower than llama cpp ?
Using M1?
I have not benchmarked it yet. What speed are you getting on CPU with LLaMa.cpp? Try reporting the ms/token with -c 2048 -n 512
. Run about 10 times, and report the fastest trial. Also exclude trials where it finishes early (before 450 tokens).
After we can quantify LLaMa.cpp on your device, I'll find a way to get a numerical measurement from your MLC experience.
To explain the benefit of using Metal correctly, you're talking about jumps from the red curve to the cyan curve. LLMs are smaller problems (1000 atoms -> 1000 tokens) and memory/latency bounded. For big stuff like Stable Diffusion, you can grossly misuse the MTLCommandQueue
(ahem, PyTorch), but the problem's so big, the compute bottleneck is even larger.
For llama.cpp, now i'm using 13b models, so its kinda slower for the model that I use for mlc (7b), yes the mlc dont have any speed displayed but it really brrr 😄
Before using the web version, I think its about >20 token/s
Before using the web version, I think its about >20 token/s
LLaMa.cpp command-line or MLC AI command-line? You gave me a number of 50 ms/token.
yes the mlc dont have any speed displayed but it really brrr
Can you get a screen recording of it? On your specific computer.
Using webgpu, this is what I got
The strange thing is for stable diffusion is not "that fast"
It gives 47 ms/token decoding. Compare to LLaMa.cpp ctx=512, 38 ms.
Latency per 512 tokens:
LLaMA | 6.7B | 13.0B | 32.5B | 65.2B |
---|---|---|---|---|
PyTorch | 116.1 s | 558.1 s | OOM | OOM |
Web LLM | 30.1 s | n/a | n/a | n/a |
LLaMa.cpp | 20.8 s | 38.6 s | 88.9 s | 17839.5 s |
MPSGraph | ||||
Metal FlashAttention | ||||
Theoretical Lower Bound | 4.4 s | 8.6 s | 21.6 s | 43.3 s |
Why does the tokens per second get monotonically slower as I get farther into the conversation (Web LLM)?
I don’t really know what I’m doing, but I translated the core of LLaMA_MPS to MPSGraph. The hard part is figuring out how to load the weights.
https://gist.github.com/philipturner/23e30121a6a898f501d03f117bfe6f92
I got the neural network to run 3x faster. It's going to be several weeks before I publish the Metal code - can you wait until then?
@philipturner i'll wait, is it for your repo or for llama.cpp ?
I'm making a repo that does a lot more than just optimize quantized GEMV. It's also multi-vendor. Should be easy to integrate into llama.cpp.
@philipturner 👍
Basically, I'm doing everything I can, so Apple platforms can get properly supported by Modular AI. It's a long ways away, but eventually, we won't need to make custom AI frameworks (e.g. GGML) just to run a language model fast.
cool
edit : what do you think about mlc ? Is it as fast as using metal directly ?
MLC seems to dispatch to TVM, which uses neural networks to guess how to run a neural network the fastest way. There's a much simpler and faster solution to the matrix multiplication problem, which Modular implemented with flying colors. Also TVM only supports AI inference, not AI training.
@philipturner thanks
@evanmiller This info is outdated (and likely wrong) - I am working on offloading the full computation on the M1 GPU with custom kernels and hope to do it properly this time around.
@ggerganov I guess it wouldn't hurt to drop the Q4 shader variant (not the full FlashAttention though). I recommend using metal-cpp instead of the ObjC or Swift bindings. Have fun making the Apple GPU go brrr 😄
@philipturner
Thanks for the info
I recommend using metal-cpp instead of the ObjC or Swift bindings
What are the benefits of metal-cpp
?
My M1 GPU implementation is here: https://github.com/ggerganov/ggml/pull/108
I currently have prepared the Metal example to be able to load the ggml
compute graph together with all necessary data. Next step is mapping it to a command buffer and implementing the custom Metal kernels as needed.
The path is overall clear, with the only question of how exactly to support dynamic shapes (i.e. tensors with size that depends on the number of input / processed tokens). The straightforward way seems to be to recreate the command buffer for each generation - not sure about the overhead of this. If the overhead is too much, would need to think about some alternative approach.
My code example makes a single command buffer for each token generation. Even a single command buffer per layer would be reasonable, just not a single cmdbuf per elementary operation (what PyTorch does). Also don’t break the cmdbuf into multiple encoders (which removes the benefit of one cmdbuf). If you need to copy buffers via blit encoder, I wrote a very fast compute shader with the same functionality.
Regarding dynamic sizes, I highly recommend you look through my MPSGraph Swift code example a few comments above.
I prefer metal-cpp because ObjC is pretty much a deprecated language. It’s been replaced by Swift and I refuse to learn ObjC just to write Metal code. I have a long history about that. So for C stuff, I will use C++ bindings over ObjC any day.
My choice is mostly personal preference, however, metal-cpp has the same functionality as Metal ObjC bindings. As long as you understand the NS::SharedPtr
memory model, it’s quite straightforward to use. Also more comprehensible to non-Apple devs (what is an @interface
and @implementation
? vs what is class A: class B { public:
?). For example, VkFFT has used the C++ bindings.
To remove the dependency on MPS/MPSMatrixMultiplication, use the early stage SIMD-group matmul here (faster than MPS for FP16). It requires aligned and non-transposed matrices - the latter restriction is trivial to lift. For the former:
First matmul in attention: 32/40/64 are multiples of 8, 52 is not (zero pad to 56)
LLaMA 6.7B: 32-wide block size for second matmul in attention
LLaMA 13.0B: 40-wide block size
LLaMA 32.5B: two shader invocations, one with block 32, another block 24, and modify the code to stride the memory accesses to 56
LLaMA 65.2B: 32-wide block size
i have an M2 Max 96GB at my disposal should you like me to perform tests. i have some experience in ML using Python. would very much like to help
Closing this as the Metal implementation has now officially landed on master
The CUDA acceleration is very impressive. Does anyone know of any efforts to run this on the GPU cores of the M processors? I'd be willing to assist but I'd rather not start from scratch if something exists.