low performance with Llama-2-13b-hf

PYNing commented 10 months ago

I tried to quantify model Llama-2-13b-hf using bitsandbytes, but I found that int4 inference performance is lower than fp16 inference, whether it is in A100 or 3090. Can you tell me why and how to troubleshoot this problem. Thanks.

Version 0.41.1（in TGI 1.0.1）

Test Command docker run -d --runtime=nvidia --gpus all --shm-size 16g -p 23811:80 -v /data/model:/data/model ghcr.io/huggingface/text-generation-inference:1.0.1 --model /data/model --num-shard 2 --max-total-tokens 4096 --max-input-length 3072 --quantize bitsandbytes (or bitsandbyts-nf4 or bitsandbyts-nf4)

Test Result 63fa03572db0d46088e2aa83e0d30db

Qubitium commented 10 months ago

Short version is there are no hardware accelerated "nf4" types on the gpus. They only have FP32, FP16, BF16 (some), FP64 (some), and FP8 (some). So the code has to emulate nf4 on primarily fp32/fp16 hardware.

junzhang-zj commented 9 months ago

Curious to ask if you implemented 4bit or nf4 acceleration yet?

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

vonchenplus commented 2 months ago

Hello @TimDettmers, Thanks for your great work.

Any update on this issue? I've also noticed on my end that 8bit or 4bit is much slower than fp16.

bitsandbytes 0.41.0

TimDettmers / bitsandbytes

low performance with Llama-2-13b-hf #712