4bit inference is slow - Githubissues

yangjianxin1 commented 1 year ago

Thanks for the excellent work. When I use 4bit to inference, it's very slow, even slower than 8bit inference. Will you plan to solve the problem, thanks~

JianbangZ commented 1 year ago

I think Tim is working on the 4bit inference kernel which hopefully will be available in the coming weeks

drxmy commented 1 year ago

I think Tim is working on the 4bit inference kernel which hopefully will be available in the coming weeks

During inference, Will the model also convert between fp16 and nf4? If so, does the inference kernel make up that time? BTW, maybe you guys can checkout https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-inference-4bit. In my test, it is even faster than fp16 in a a10.

Oxi84 commented 1 year ago

Yes, this one is pretty fast around 2x faster in 4bits that fp16.

But faster qlora will be better as it supports most models available. With GPTQ you can pretty much only do gpt like models. Roberta or T5 are not supported for example.

uahmad235 commented 1 year ago

Any updates on the 4bit inference kernel? @JianbangZ I am also experiencing issues with inference speed.

trunghlt commented 1 year ago

We have written a small script to convert models trained with QLoRA to CTranslate2 to speed up inference here https://github.com/Actable-AI/llm-utils/blob/main/qlora2ct2/convert_qlora2_ct2.py

vgoklani commented 1 year ago

@trunghlt thanks for sharing - do you have any benchmarks that compare the speed improvement with CTranslate2?

trunghlt commented 1 year ago

for us, it's 4x-6x speed up.

vgoklani commented 1 year ago

thanks @trunghlt what type of hardware are you running? A100s? also, are you using int8 for inference? which models have you converted successfully?

trunghlt commented 1 year ago

We have successfully converted mt0-xxl on RTX6000

JianbangZ commented 1 year ago

We have successfully converted mt0-xxl on RTX6000

What's the absolute number you see? how many token/s or ms/token? I have a RTX6000 as well and I have numbers for ggml and GPTQ, would like to have a comparison.

eleluong commented 1 year ago

Ctranslate2 have much more than ggml and GPTQ because they support a lot of diffences model, not just encoder only. Also they support batching so inference much more optimize. And when i tested with mt0 xxl, which takes about > 40gb if load full, after convert to CT2 and use "int8_float16" quantization, it only takes about < 20gb and inference much faster. I will provide the benchmark of inference time soon.

trannhatquy commented 1 year ago

@JianbangZ On my setting, if you convert mt0 xxl (13B) model into ct2 (using int8_float16 quantization), and then do inference with compute_type = int8_float16 and 300 samples on RTX6000, there are some numbers:

Beam_size = 1, batch_size = 1 (mean do inference with batch = 1), avg_input_token = 143 (I just split input and calculate the length) --> latency_per_output_token = 25.51ms/tok, output_tok_per_sec = 39.53tok/sec
Beam_size = 5, batch_size = 1, avg_input_token = 143 --> latency_per_output_token = 34.17ms/tok, output_tok_per_sec = 31.65tok/sec

trannhatquy commented 1 year ago

@JianbangZ @vgoklani if you further set the batch_size = 4, ((mean do inference with batch = 1) and beam_size = 5 with the same setting as above, the inference speed is 2x faster compared to batch_size = 1: latency_per_output_token=16.65ms/tok, output_tok_per_sec=67.64tok/sec

vgoklani commented 1 year ago

thanks @trannhatquy how good is the quantization? Do you have a ppl comparison? I'm planning to use MPT-7B and eventually MPT-30B

trannhatquy commented 1 year ago

@vgoklani I just compare based on my tasks (name entity recognition, absa and relation extraction), the performance (accuracy, f1, recall) of quantization ct2 model of mt0 xxl is just 0.5% to 1.5% lower than the full mt0 xxl model, this decrease is negligible in term of inference speed. And also ctranslate2 support MPT model

skymizerFuji commented 11 months ago

We used jetson Orin 16G and found that 4bit is slower than 8bit 8bits:2.29 tokens/sec 4bits:0.73 tokens/sec

datadoktergroen commented 5 months ago

How does this deal with the mixed precision problem of merging?

Generally the base model is in 4-bit and the LoRA adapter in 16-bit. If you merge this your LoRA adapter, its weights are lowered to 4 bit. -> loss of performance w.r.t. keeping a separate adapter.

eleluong commented 5 months ago

How does this deal with the mixed precision problem of merging?

Generally the base model is in 4-bit and the LoRA adapter in 16-bit. If you merge this your LoRA adapter, its weights are lowered to 4 bit. -> loss of performance w.r.t. keeping a separate adapter.

Actually qlora not actually save the model in 4-bit (as far as i know). Its only use 4bit when training and when u merge 16bit adapter to 16 bit base model, the precision is not mixed so i don't think that a problem.

artidoro / qlora

4bit inference is slow #32