Add Ascend NPU support for nf4 quant

statelesshz commented 1 day ago

What does this PR do?

This PR adds Ascend NPU support for nf4 quant/dequant, and make it possible to do QLoRA fine-tuning for LLMs using transformers, peft, and trl.

You may notice that the nf4 quantization method is currently implemented in pytorch. This interim measure is due to the fact that the high-performance version implemented using AscendC is still in progress 😞 . Meanwhile, we've received feedback from many in the Ascend NPU community expressing their keen interest in using QLoRA to fine-tune Large Language Models (LLMs) at the earliest opportunity.

Collaborators

@SlightwindSec @Ginray @MatrixPlayer

cc @Titus-von-Koeller

statelesshz commented 1 day ago

Refer to this blog, I did a E2E test on the llama2-7b-hf with QLoRA fine-tuning in my env with NPU device, it works 🤗.

Here is the script I used.

baymax591 commented 11 hours ago

Thanks a lot for sharing this PR and the video demo! Thanks to the demo, I was able to successfully run NF4 quant/dequant on the NPU with ease. The detailed explanation in the video really helped me understand the process and key steps. Looking forward to more updates in the future—great work!

baymax591 commented 11 hours ago

I hope this PR can be merged soon, as it provides valuable improvements. Looking forward to seeing it merged! cc @Titus-von-Koeller

bitsandbytes-foundation / bitsandbytes

Add Ascend NPU support for nf4 quant #1422

What does this PR do?

Collaborators