Hi folks, I'm just curious about why the Dequantization process is necessary here when finetuning the LoRA weights. Could we further quantize the input X to 4-bit and also learn the LoRA weights in 4-bit instead of 16-bit as presented in the paper? By doing so, we don't need to dequantize any weights -> save computational resource?
Is there something secret here and does NVIDIA support for 4-bit matrix multiplication?
Hi folks, I'm just curious about why the Dequantization process is necessary here when finetuning the LoRA weights. Could we further quantize the input X to 4-bit and also learn the LoRA weights in 4-bit instead of 16-bit as presented in the paper? By doing so, we don't need to dequantize any weights -> save computational resource?
Is there something secret here and does NVIDIA support for 4-bit matrix multiplication?