TimDettmers / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
5.76k stars 583 forks source link

low performance with Llama-2-13b-hf #712

Closed PYNing closed 6 months ago

PYNing commented 10 months ago

I tried to quantify model Llama-2-13b-hf using bitsandbytes, but I found that int4 inference performance is lower than fp16 inference, whether it is in A100 or 3090. Can you tell me why and how to troubleshoot this problem. Thanks.

Version 0.41.1(in TGI 1.0.1)

Test Command docker run -d --runtime=nvidia --gpus all --shm-size 16g -p 23811:80 -v /data/model:/data/model ghcr.io/huggingface/text-generation-inference:1.0.1 --model /data/model --num-shard 2 --max-total-tokens 4096 --max-input-length 3072 --quantize bitsandbytes (or bitsandbyts-nf4 or bitsandbyts-nf4)

Test Result 63fa03572db0d46088e2aa83e0d30db

Qubitium commented 10 months ago

Short version is there are no hardware accelerated "nf4" types on the gpus. They only have FP32, FP16, BF16 (some), FP64 (some), and FP8 (some). So the code has to emulate nf4 on primarily fp32/fp16 hardware.

junzhang-zj commented 9 months ago

Curious to ask if you implemented 4bit or nf4 acceleration yet?

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

vonchenplus commented 2 months ago

Hello @TimDettmers, Thanks for your great work.

Any update on this issue? I've also noticed on my end that 8bit or 4bit is much slower than fp16.

bitsandbytes 0.41.0