Open Ther-nullptr opened 3 months ago
We do have a more optimal GEMV path for inference with batch size of 1, but otherwise your thought process here is sound. It should be possible, and I would suggest following along with a potential FLUTE integration in #1293.
Feature request
A fused CUDA kernel that combine the dequantize main weight step and matrix multiplication, in order to reduce the data on-chip/off-chip movement.
Motivation
I use profile tools to analyze the breakdown of QLoRA:
Notice that the quantize/dequantize process of main weight occupies near 30%~50% of the main matrix multiplication. Analyzing the computing process:
So is it possible to fuse the kernels to act like that:
Thus we only have to launch 1 kernel, and save 1 time of 16bit weight load, 1 time of 16bit weight store.
Your contribution
I just observe this, and I want ask is this idea possible.