Fp4 - x2 times slower than fp16

bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

https://huggingface.co/docs/bitsandbytes/main/en/index

MIT License

6.18k stars 619 forks source link

Fp4 - x2 times slower than fp16 #465

Closed NeonBohdan closed 1 year ago

NeonBohdan commented 1 year ago

I have cuda118, rtx A5000, bitsandbytes==0.39.0 When shifting from fp16 Lora fine-tuning to int8 or fp4 I have immidiate x2 performance drop

Is it an expected behaviour or I have a problem on my side? Looks related to https://github.com/TimDettmers/bitsandbytes/issues/6 And similar to https://github.com/TimDettmers/bitsandbytes/issues/347

Oxi84 commented 1 year ago

for 8bit this is expected for lora (and guess for qlora as well). It worked slower for some finetunes but in fp16 i could put just one batch.

Do you get the same slowdown for both? For me 8bit is 4-5x slower and 4 bit 1.2-2x slower only?

NeonBohdan commented 1 year ago

Thanks For int8 I have x2-x3 drop For nf4 20-100% drop for 7B model depending on batch size For 3B drop its bigger

But 7B can fit in 24GiB with bs=1 and seq_len=256, and nf4 helps further increase seq_len Also maybe 13-30B model will have lover performance drop with qlora and bs=1