Open ggerganov opened 1 year ago
supports all quantization types
The best place for dequantization would be after you've loaded data from device -> threadgroup memory, and are loading into registers. I hypothesize it will faster to have each of the four simds in a threadgroup unpack on their own, even if that duplicates the work two times.
For example, you might fork my MFA repository, then modify the line here to unpack after loading.
While I do understand the desire for the project to be independend of other libraries, I personally do not think removing the excellent cuBLAS implementation entirely is a good idea. One of the main benefits of this project is that you can run large models with good speeds on lower end hardware with less VRAM. For example, thanks to the cuBLAS' fast prompt processing of 15 ms/t I can enjoy a 13B parameter model with full context at around 1.6 token/sec, which is much faster than running GPTQ with cpu offloading (0,4 token/s) on my RTX 2060 laptop.
As you've said yourself, a native GEMM implementation would likely be slower than what cuBLAS is offering. Even if the performance difference is not drastic on your hardware, it can make all the difference on hardware like mine, even a slight performance difference (for example 15 ms/t to 23 ms/t) would lead to a worse experience for people who run huge LLMs too big for their systems. Not just for low to mid spec hardware, but for high end hardware trying to run 65B models as well.
Please keep this perspective in mind as you continue to develop the project. I don't think it would be a good outcome for me and many others to have to downgrade to older versions with cuBLAS and not enjoy new enhancements just because older versions run faster due to cuBLAS support.
Why would custom code be slower than cuBLAS? There is Nvidia CUTLASS. The only things it can be are equal or faster.
Why would custom code be slower than cuBLAS? There is Nvidia CUTLASS. The only things it can be are equal or faster.
They aren't going to use CUTLASS though, because it'd be another third party lib.
CuBLAS is highly optimized for the hardware and Georgi said himself that he is aware a custom code is not going to perform as well in this comment:
https://github.com/ggerganov/llama.cpp/issues/1867#issuecomment-1595702365
And I fully understand that it will be close to impossible to achieve the maximum performance available from dedicated libraries (such as cuBLAS, for example).
300 lines of code, (soon to) outperform all of Apple's proprietary Metal Performance Shaders.
https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal
If we take some code from CUTLASS, maybe just optimize for the matrix shapes that exist in LLaMA.
A custom GEMM implementation will be faster with quantized models - that's one of the goals. There may be a small performance regression with f16 and f32 models, though.
Could we utilize the int4 hardware support in Ampere tensor cores?
From what I have seen of the way int4 works with tensor cores, I don't think so. We cannot do a matrix multiplication directly in int4, we need to dequantize to f16 or f32 first. But we can still use the tensor cores after dequantizing to float. I may be wrong, though.
A custom GEMM implementation will be faster with quantized models - that's one of the goals. There may be a small performance regression with f16 and f32 models, though.
Glad to hear that. I hope that will be the case.
Could we utilize the int4 hardware support in Ampere tensor cores?
Just a heads up: Tensor cores in Turing, Ampere and Ada Lovelace support INT4, INT8 and FP16 instructions. Ampere and Turing support INT1 as well.
While only Ampere and Ada support FP8 and FP32 in addition to that.
So ideally, the code would use FP16 or the integer instructions I mentioned to cover a wide range of hardware with tensor core support.
Didn't they remove INT4 on Ada and Hopper?
Didn't they remove INT4 on Ada and Hopper?
Only on Hopper. INT4 is still present in consumer Ada Lovelace.
Compared to Ampere, Ada delivers more than double the FP16, BF16, TF32, INT8, and INT4 Tensor TFLOPS, and also includes the Hopper FP8 Transformer Engine, delivering over 1.3 PetaFLOPS of tensor processing in the RTX 4090.
It looks like Ada removed support for INT1 though.
I'd tend to agree with @Dampfinchen : being interested in the Intel platform I don't believe we'll be able to outperform the MKL and oneAPI engineers.
I tried implementing a (dequantization +) matrix matrix multiplication CUDA kernel but I'm struggling to get past 50% of cuBLAS performance for prompt processing. In particular, I found against my expectation that fusing dequantization + matrix matrix multiplication does not have a large impact on performance, possibly because you're limited by compute rather than memory bandwidth for large matrices (I am currently not using tensor cores).
What level of performance/sophistication is the goal for something that could possibly be merged? As of right now my implementation could already be useful for token generation since the VRAM usage will be lower compared to cuBLAS. But for prompt processing it is clearly worse. In general I think leaving cuBLAS as a compilation option would be desirable because given the small impact of tensor fusion I don't think I can realistically beat it for prompt processing performance.
50% of cuBLAS performance for prompt processing.
If you get 50% of cuBLAS performance, then what is cuBLAS performance in ALU utilization? Perhaps both underutilize the processor, leaving much room to improve.
If y'all progressively get rid of blas libraries, cublas is probably lowest on the totem pole? AFAIK users still need the huge cuda toolkit to run cuda inference anyway, so its hardly even getting rid of a dependency.
CLBLAST and CPU BLAS, on the other hand, can be tricky, but their implementations are open source. I just tried to get the OpenBLAS build working on an Ampere instance for a few hours... and ultimately failed.
Why not just write the entire thing in one Mojo file.
What level of performance/sophistication is the goal for something that could possibly be merged?
Personally, I wouldn't want this enabled by default until the performance with quantized models is at least comparable to cuBLAS. It's ok if performance with f16/f32 models is worse. But I think it could still be useful to have it merged as an option, just disabled by default. It would be a starting point, and we could keep improving it over time until we reach the performance goal.
Why not just write the entire thing in one Mojo file.
Because Mojo is vaporware and doesn't actually exist yet? You can't yet download or run Mojo, it is useless for now.
There's a reason they're close-sourcing it for now, the same reason I close-sourced MFA for ~2 months. It's too buggy at the moment and will develop faster the way it is now.
Now I open-source when it is ready, and the decision pays off.
Because Mojo is vaporware and doesn't actually exist yet?
When I learned what Modular was doing, I quit AI and shifted careers. No cap. There's nothing left for me to do because Modular is going to solve it.
There's a reason they're close-sourcing it for now, the same reason I close-sourced MFA for ~2 months. It's too buggy at the moment and will develop faster the way it is now.
Now I open-source when it is ready, and the decision pays off.
While I absolutely don't claim to know that it will never release, there is plenty of reason NOT to bet on it just yet considering it isn't public.
It's not just closed source, you can't even download binaries yet.
@ggerganov Can you clarify the current/planned threading model for CPU computation? This seems like it should be central to the discussion... BLAS is multi-threaded and works extremely well when the calling program is single-threaded. GGML appears to use threads to support concurrent execution of unrelated tasks. I wonder if the majority of workflows would be better off with a single-threaded top-level scheduler, with all cores assigned to work on individual large-ish tasks, a la the BLAS computation model.
I must say that this is simply not possible. I recommend reading the paper titled Anatomy of High-Performance Matrix Multiplication, written in 2008. Achieving high performance requires significant sacrifices. Have you looked at the code for Goto BLAS or OpenBLAS? They are all written in assembly! Yes, assembly language! You need to understand intimately how the hardware works and gauge exactly how far you can push it in order to achieve maximum performance. @ggerganov
You would be correct if we were just doing FP32 + FP32 -> FP32 matrix multiplication. But we are not. The matrix is quantized to some custom data format that consists mostly of low-precision integers + some floating point scales. This data format can not be directly used by any regular BLAS library. So currently the quantized data needs to be converted to FP32 first which costs you both compute time and extra memory.
If you were to instead convert the hidden state from FP32 to q8_1 you would also be able to drastically reduce the amount of floating point instructions and replace them with SIMD integer instructions which are much faster. Consider the current state of CUDA mul_mat_q kernels: they use 700/970/1430 MiB less memory than cuBLAS for 7b/13b/33b and they are up to 2x faster (depending on hardware and quantization format). This is not because I can write the absolute best GEMM kernels but simply because I wrote GEMM kernels that take advantage of the specific ggml data format, both in terms of data types and the memory layout.
Great idea! I hadn't previously considered the overhead caused by the custom data format. However, I still believe that while minimizing overhead, we should use these BLAS libraries as much as possible to ensure optimal performance across different hardware. Because unlike CPUs, there are significant architectural differences between GPUs. Even products from the same company can have vast differences between generations. Every time NVIDIA introduces a new GPU architecture, CUDA has to undergo major updates to achieve the best performance on the new hardware. So, a custom-written kernel needs continuous maintenance, and the effort required is substantial. If you really don't want to use BLAS, I suggest you take a look at these: Deep Learning Compilers How Rammer squeezes more out of accelerator performance
I don't think a single file can use all the hardware features of every processor, until Mojo comes around. We don't have a unified language, as CUDA only runs on gaming rigs and high-end laptops (regarding consumer hardware). Metal runs on 1 billion smartphones but is very different. Plus to use simdgroup_async_copy
you have to pre-compile offline using a command-line tool from the archived Xcode 14.2 binary (dependency nightmare).
For mat-vec multiplication, it makes sense to dequantize in place. For mat-mat multiplication, dequantizing in-place increases the total number of operations while the ALU is already saturated. Plus, the proposed single-file idea will probably skip important hardware features (e.g. simdgroup_async_copy
) that get full ALU saturation in the first place. I've been discussing quantized mat-mat multiplication in another AI application, and we decided on dequantizing to a small scratch MTLBuffer
before calling into a pre-compiled FP16 x FP16 GEMM kernel from MFA (not MPS).
NOTE: By single-file I do not mean it has to literally be a single file. But that is the general sense of what this idea seems to be close to.
For mat-mat multiplication, dequantizing in-place increases the total number of operations while the ALU is already saturated.
The goal is not to dequantize in place but to quantize the hidden state to q8_1 once per matrix matrix multiplication and to then do the calculations entirely using the quantized formats. This lets you replace floating point arithmetic with integer arithmetic or SIMD instructions so it should end up being faster.
The GPU is already a SIMD architecture. Do you mean an optimization only applicable to CPU? If you're using an entire SIMD vector instruction for one scalar, that's underutilizing the SIMD ALU by a factor proportional to vector width.
I mean to use this instead of floating point arithmetic.
this might be relevant: https://github.com/ashvardanian/SimSIMD
On Mozilla's llamafile project, we managed to get ggml-cuda.cu
to not need to depend on cuBLAS any more, by whipping up an implementation of the four or so GEMM functions it needed. We called it tinyBLAS and it's worked reasonably well so far. The advantage for us has been enabling our prebuilt llama.cpp binaries to run on stock Windows installs, thus aiding distributability. See https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/tinyblas.cu Would you want us to upstream this? cc: @stlhood
Well what's the performance like?
For many LLMs I've tried, tinyBLAS goes 4% slower. Some measurements:
I also tested on Jetson and NVIDIA L4. The numbers were basically the same.
However, LLaVA image processing currently only goes 50% as fast as cuBLAS when using tinyBLAS. We have some changes in flight like Mozilla-Ocho/llamafile#156 for improving that. One of the benefits I see to upstreaming is that I'd love to have an opportunity to collaborate with folks here on improving that.
What batch size are we talking about here? For a batch size of 1 you could already run e.g. LLaMA models completely without cuBLAS with moderate performance penalties. The problem is efficient matrix multiplication with large batch sizes where you are compute bound rather than I/O bound. Currently for LLaMA 2 q8_0 prompt processing using my RTX 3090 I get 3400 t/s with cuBLAS and 2300 t/s using MMQ.
We learned that the hard way. Our first pass coding tinyblasGemmEx() processed batched data 1600% slower. Adding 2d blocking and __shared__
memory was what enabled us to make it only go 50% slower.
oh my god, tired of mutiBLAS madness and you save my life @jart It take me 3 days to dive in ocean of runner and this one only take me 30 minutes to make sure it work.
58.44 tokens per second
with RX 6600 XT , no CPU overload
just download llamafile from release page, add .exe and voila!
./llamafile-0.6.exe -ngl 35 --gpu amd -m ../models/vinallama-2.7b-chat_q5_0.gguf
I forgot to say: I recently looked at the cuBLAS kernels in detail and it seems that they are not optimized for consumer GPUs at all. This is because while professional GPUs typically have powers of 2 as the SM count the SM count on consumer GPUs is more awkward to optimize for (which seems intentional to me). As a consequence the last wave in cuBLAS FP16 GEMM has poor GPU utilization. The overall GPU utilization is ~15% lower at a batch size of 512 for 7b (larger models and batch sizes are less affected). So depending on the amount of invested effort it may be possible to write CUDA kernels that outperform cuBLAS FP16 GEMM on consumer GPUs.
That's great to hear @hiepxanh! I see you have an AMD GPU. We also learned earlier in https://github.com/Mozilla-Ocho/llamafile/issues/188#issuecomment-1892411089 that tinyBLAS is helping people with AMD laptops, because rocBLAS is designed for the HPC market and was never tested on architectures like gfx1103, where it currently fails. tinyBLAS works, and pumping up AMD consumer support is a priority.
Also @ahgamut and I rented an NVIDIA RTX 3080 for nine hours yesterday and did some work on tinyBLAS. Here's our latest progress. Earlier, I reported tinyBLAS took 2x longer (total time) to do LLaVA image processing (which requires chewing on a meaty 512 token batch). Now tinyBLAS only takes 1.43x longer. That's thanks to our most recent iteration in https://github.com/Mozilla-Ocho/llamafile/pull/205 which is helping us close the gap. Here's the executive summary:
Please help us @JohannesGaessler because I hope what you're saying is true. I would love nothing more than for us to be able to claim basic linear algebra subprogram supremacy on consumer hardware. Some of our recent changes, like C++ templatizing the tinyBLAS block kernel, I believe will help us ship a pareto optimized set of code paths tuned for popular consumer hardware, while having safe fallbacks for everything else. For example, on previous iterations, I noticed development work we did on NVIDIA A100 needed to be recalibrated to also run well on a $300 AMD GPU, possibly due to a smaller register file. We had to trade away a few percent on the A100, but it made a 20x difference for the Radeon. With templates that shouldn't be an issue anymore.
Sorry, I don't have the time to work on TinyBLAS. I'm currently working on matrix multiplication using int8 tensor cores https://github.com/ggerganov/llama.cpp/pull/4801 and I have a lot of other llama.cpp related things queued up after that which I think are more important. And even if I had more time writing better FP16 GEMM than cuBLAS for consumer GPUs would still be very difficult and time consuming.
This is a big one
The only reason we use BLAS is that we don't have efficient implementation of
matrix x matrix
multiplication. Naively doing parallel dot products is not optimal. We need to implement some of the fundamental GEMM optimizations such as block tiling and we need to implement this in a compact way that reuses the existing dot product code and supports all quantization typesMore comments on this: