artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
10.06k stars 822 forks source link

[Questions]: How to implement NF4/NF2 matmul kernel function? #286

Open llCurious opened 10 months ago

llCurious commented 10 months ago

hi @TimDettmers .

The paper shows that you quantize the weights to 2/4 bits using NF format. I wonder how to you handle the input activations (denoted as x). Is x also quantized to 2/4 bits?

If not, do you dequantize the quantized weights to float format to perform the matmul using float kernels? This seems to slow down the inference efficiency despite lower memory footprint.

neoweasley commented 4 months ago

The paper "QLORA: Efficient Finetuning of Quantized LLMs" says "In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to BFloat16, and then perform a matrix multiplication in 16-bit." As far as i'm concerned, NF16 and Double Quantization may slow down the training and inference efficiency, which means this work sacrifices time and performance for memory. However, this paper reports "24hours of finetuning on a single GPU", and I wonder how.

XA23i commented 4 days ago

in uniform quantization we can do xq = [x/s] + offset and \hat{xq} = (x - offset) * s. However, in NF4 quantization, we need to find the nearest quantization outputs of x. I am wondering how to implement it efficiently.