Closed PYNing closed 6 months ago
Short version is there are no hardware accelerated "nf4" types on the gpus. They only have FP32, FP16, BF16 (some), FP64 (some), and FP8 (some). So the code has to emulate nf4 on primarily fp32/fp16 hardware.
Curious to ask if you implemented 4bit or nf4 acceleration yet?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hello @TimDettmers, Thanks for your great work.
Any update on this issue? I've also noticed on my end that 8bit or 4bit is much slower than fp16.
bitsandbytes 0.41.0
I tried to quantify model Llama-2-13b-hf using bitsandbytes, but I found that int4 inference performance is lower than fp16 inference, whether it is in A100 or 3090. Can you tell me why and how to troubleshoot this problem. Thanks.
Version 0.41.1(in TGI 1.0.1)
Test Command
docker run -d --runtime=nvidia --gpus all --shm-size 16g -p 23811:80 -v /data/model:/data/model ghcr.io/huggingface/text-generation-inference:1.0.1 --model /data/model --num-shard 2 --max-total-tokens 4096 --max-input-length 3072 --quantize bitsandbytes (or bitsandbyts-nf4 or bitsandbyts-nf4)
Test Result![63fa03572db0d46088e2aa83e0d30db](https://github.com/TimDettmers/bitsandbytes/assets/11667149/862176ee-6b4f-48c0-a582-94f469bdfc83)