Closed hy-chen closed 7 months ago
internally they are converted (dequantized) to compatible data types f32, f16, i8.
@FSSRepo Thanks for the reply. Yes before matmul it needs to be dequantized. but when the weights were stored in CPU(?) how are these datatype supported? Also is the data dequantized in CPU then moved to GPU? or moved to GPU then dequantized
Yes before matmul it needs to be dequantized. but when the weights were stored in CPU(?) how are these datatype supported?
They are just stored in blocks such as this: https://github.com/ggerganov/llama.cpp/blob/381ee195721d8e747ee31a60c0751822b3072f02/ggml-quants.h#L152-L157
Also is the data dequantized in CPU then moved to GPU? or moved to GPU then dequantized
Since there is partial offload, they can be dequantized in both CPU&GPU. But I guess the answer you are looking for is: They are dequantized as they are used to keep memory usage low. There is no separate "dequantization pass" before compute.
This looks like it should be in discussion anw.
Yes before matmul it needs to be dequantized. but when the weights were stored in CPU(?) how are these datatype supported?
They are just stored in blocks such as this:
Thank you! This is a really helpful pointer.
Also is the data dequantized in CPU then moved to GPU? or moved to GPU then dequantized
Since there is partial offload, they can be dequantized in both CPU&GPU. But I guess the answer you are looking for is: They are dequantized as they are used to keep memory usage low. There is no separate "dequantization pass" before compute.
This looks like it should be in discussion anw.
Yes, I understand they are only dequantized when needed. The difference between doing this in CPU vs GPU, from my understanding, is the former gives benefit of faster data movement from CPU to GPU, since the weights are still compressed when being moved. Since LLM decoding is usually memory bandwidth bounded, this benefit will improve performance. I wonder if in this codebase it is dequantized in CPU because from the code you pointed above, it is stored in a self-defined way, I'm not sure how dequantization of the struct can work in GPU. Could you elaborate on "There is no separate dequantization pass before compute"? maybe that will help clarify things as well.
Using quantization, I am able to run 7b models. I can get a 2x speedup by offloading to the GPU using Nvidia cuBLAS. But only if I use -ngl 33
to offload all 33 layers to the GPU. If I do not offload all the layers, the speed is about the same as running on CPU. Or slightly-slower, in fact. And then I may as well compile without cuBLAS.
The Elitebook 8760w has a quad-core CPU. And the video card is pretty old, Quadro M3000M 4Gb card. So that almost makes sense. Not worth using GPU unless it has VRAM to offload all those layers. Unless I'm missing something...
I'm not sure how dequantization of the struct can work in GPU.
You can check the source.
There are multiple backends (CUDA, Metal...) but for CUDA, this one of the kernels: https://github.com/ggerganov/llama.cpp/blob/381ee195721d8e747ee31a60c0751822b3072f02/ggml-cuda.cu#L1353
As for how does it know which one to call, it's just a switch/case: https://github.com/ggerganov/llama.cpp/blob/381ee195721d8e747ee31a60c0751822b3072f02/ggml-cuda.cu#L7792
The tensors have types info.
Could you elaborate on "There is no separate dequantization pass before compute"? maybe that will help clarify things as well.
It just means what you said. The de-quantization only happens when needed.
I see, so dequantization is done in a separate kernel before matmul. Even if we offload all layers to GPU, where the CPU to GPU movement is saved, the GPU DRAM to SRAM memory is not saved, right? Because it got decompressed in a separate kernel.
I guess so. This is a separate launch for the GGML_OP_GET_ROWS
: https://github.com/ggerganov/llama.cpp/blob/381ee195721d8e747ee31a60c0751822b3072f02/ggml-cuda.cu#L6013
Would fusing them help with performance? Probably. But there are more ops than just get row and multiply: https://github.com/ggerganov/llama.cpp/blob/381ee195721d8e747ee31a60c0751822b3072f02/ggml-cuda.cu#L10052 This is the price for partial offload and supporting this many model architectures and compute backends.
I suspect that simply tracing linear parts in the execution graph and generate code for them as a kernel would improve performance. I haven't looked deeply but MLC-LLM does look like it takes that approach.
Right now, I'm a happy user of llama.cpp. There are projects with better performance but I can't stand their API.
Interesting idea from this conversation: Would running a separate "optimization pass" through the graph that fuses some ops into a single "super op" help with performance? I could be totally off here. This whole thing is new to me. This idea comes from my previous experience in optimizing bytecode interpreter. Simply fusing common adjacent instruction sequences into super instructions does help performance. Executing the compute graph looks like an interpreter to me. For bytecode interpreter, fusing helps with reducing branch (mis)prediction cost. For GPU it could be synchronization cost between device and host. Prediction cost is negligible since each op takes so long to execute anyway. Moreover, fusing ops could allow the compiler to optimize better.
This should totally move to discussion lol.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
There is no such datatype as INT6 INT5 or INT3 as seen in the specs?