Open yangjianxin1 opened 1 year ago
I think Tim is working on the 4bit inference kernel which hopefully will be available in the coming weeks
I think Tim is working on the 4bit inference kernel which hopefully will be available in the coming weeks
During inference, Will the model also convert between fp16 and nf4? If so, does the inference kernel make up that time? BTW, maybe you guys can checkout https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-inference-4bit. In my test, it is even faster than fp16 in a a10.
Yes, this one is pretty fast around 2x faster in 4bits that fp16.
But faster qlora will be better as it supports most models available. With GPTQ you can pretty much only do gpt like models. Roberta or T5 are not supported for example.
Any updates on the 4bit inference kernel? @JianbangZ I am also experiencing issues with inference speed.
We have written a small script to convert models trained with QLoRA to CTranslate2 to speed up inference here https://github.com/Actable-AI/llm-utils/blob/main/qlora2ct2/convert_qlora2_ct2.py
@trunghlt thanks for sharing - do you have any benchmarks that compare the speed improvement with CTranslate2?
for us, it's 4x-6x speed up.
thanks @trunghlt what type of hardware are you running? A100s? also, are you using int8 for inference? which models have you converted successfully?
We have successfully converted mt0-xxl on RTX6000
We have successfully converted mt0-xxl on RTX6000
What's the absolute number you see? how many token/s or ms/token? I have a RTX6000 as well and I have numbers for ggml and GPTQ, would like to have a comparison.
Ctranslate2 have much more than ggml and GPTQ because they support a lot of diffences model, not just encoder only. Also they support batching so inference much more optimize. And when i tested with mt0 xxl, which takes about > 40gb if load full, after convert to CT2 and use "int8_float16" quantization, it only takes about < 20gb and inference much faster. I will provide the benchmark of inference time soon.
@JianbangZ On my setting, if you convert mt0 xxl (13B) model into ct2 (using int8_float16 quantization), and then do inference with compute_type = int8_float16 and 300 samples on RTX6000, there are some numbers:
@JianbangZ @vgoklani if you further set the batch_size = 4, ((mean do inference with batch = 1) and beam_size = 5 with the same setting as above, the inference speed is 2x faster compared to batch_size = 1: latency_per_output_token=16.65ms/tok, output_tok_per_sec=67.64tok/sec
thanks @trannhatquy how good is the quantization? Do you have a ppl comparison? I'm planning to use MPT-7B and eventually MPT-30B
@vgoklani I just compare based on my tasks (name entity recognition, absa and relation extraction), the performance (accuracy, f1, recall) of quantization ct2 model of mt0 xxl is just 0.5% to 1.5% lower than the full mt0 xxl model, this decrease is negligible in term of inference speed. And also ctranslate2 support MPT model
We used jetson Orin 16G and found that 4bit is slower than 8bit
8bits:2.29 tokens/sec
4bits:0.73 tokens/sec
How does this deal with the mixed precision problem of merging?
Generally the base model is in 4-bit and the LoRA adapter in 16-bit. If you merge this your LoRA adapter, its weights are lowered to 4 bit. -> loss of performance w.r.t. keeping a separate adapter.
How does this deal with the mixed precision problem of merging?
Generally the base model is in 4-bit and the LoRA adapter in 16-bit. If you merge this your LoRA adapter, its weights are lowered to 4 bit. -> loss of performance w.r.t. keeping a separate adapter.
Actually qlora not actually save the model in 4-bit (as far as i know). Its only use 4bit when training and when u merge 16bit adapter to 16 bit base model, the precision is not mixed so i don't think that a problem.
Thanks for the excellent work. When I use 4bit to inference, it's very slow, even slower than 8bit inference. Will you plan to solve the problem, thanks~