Why do we need the Dequantization process?

Hi folks, I'm just curious about why the Dequantization process is necessary here when finetuning the LoRA weights. Could we further quantize the input X to 4-bit and also learn the LoRA weights in 4-bit instead of 16-bit as presented in the paper? By doing so, we don't need to dequantize any weights -> save computational resource?

Is there something secret here and does NVIDIA support for 4-bit matrix multiplication?