Closed ChenMnZ closed 9 months ago
Thank you for reporting this. In my setup, the fp16 speed is closer to the nf4 speed if you run with more tokens (to smooth out the variance), but the main problem is that, for some reason, the register pressure is higher on A100 than on other GPUs. I never directly benchmarked on A100s, and this was unexpected. The high register pressure leads to an occupancy of 40%, which basically leads to a slowdown of 2.5x.
I will need to create a work-around. A simple work-around with __launch_bounds__ does not seem to help. As such, this will be a more complicated fix.
I have also seen a slowdown in my tests using bitsandbytes' 4-bit and 8-bit quantization on an A100 80G (bnb: 0.41.0, CUDA Version: 11.7).
open_llama_3B + LoRA on A100 (HF, 1 beam, float16): ~23 t/s open_llama_3B + LoRA on A100 (HF, 1 beam, bitsandbytes 4bit 0.41.0): ~16 t/s open_llama_3B + LoRA on A100 (HF, 1 beam, bitsandbytes 8bit 0.41.0): ~7 t/s
And these are the numbers after running @ChenMnZ's script:
for openlm-research/open_llama_3b fp16 speed: 56.32390258485641token/s nf4 speed: 25.252466712656336token/s
for yahma/llama-7b-hf fp16 speed: 44.824235393537535token/s nf4 speed: 20.757886320825538token/s
Note: I am getting similar numbers for bnb 0.40.2 as well.
@filipemesquita @ChenMnZ I was wondering if you have achieved a proper speedup with 4bit? It still bothers me a lot.
@ChenMnZ Hi, have you found any quantization methods to speed up the inference time for Llama 2 (7B) model, as this nf4 inference speed is comparatively low?
@Rahu218 Yes, mlc-llm can compile the quantized model and achieve nearly 2x speedup. For more details, you can refer my recent work OmniQuant, and see this file. However, there are some problem with mlc-llm so that it can not run 3-bit models successfully, but you can try the 4-bit quantization by yourself.
Also, AWQ can also achieve a significant speedup.
thanks for the reply @ChenMnZ, I am currently working of speed up the inference time for my Llama 2 (7B) model, with Bits and Bytes Quantization. the code is provided below!!
can you help me by guiding on how to use this OmniQuant technique for my case to lower the inferece time, i am running the model on Google Colab pro.
Original inference time was 35 sec Inference time after Quantization according to the below code is: 60sec.
code: name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1, do_sample=True, eos_token_id=tokenizer.eos_token_id, device_map="auto", # finds GPU max_length=2000, top_k=10, top_p=0.9, temperature = 0.8, batch_size=1, )
llm = HuggingFacePipeline(pipeline = generation_pipe)
I am also seeing slower speed with 4bit vs FP16 with OpenNMT-py / Mistral batch_size=1
NF4
[2023-11-22 14:19:34,537 INFO] Loading checkpoint from mistral-7B/mistral-sft_step_1000.pt
[2023-11-22 14:19:38,534 INFO] bnb_NF4 compression of layer ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
[2023-11-22 14:19:39,276 INFO] Loading data into the model
[2023-11-22 14:19:55,469 INFO] Total translation time (s): 14.0
[2023-11-22 14:19:55,469 INFO] Average translation time (ms): 7004.4
[2023-11-22 14:19:55,469 INFO] Tokens per second: 36.5
Time w/o python interpreter load/terminate: 20.941147565841675
FP16
[2023-11-22 14:17:20,412 INFO] Loading checkpoint from mistral-7B/mistral-sft_step_1000.pt
[2023-11-22 14:17:24,415 INFO] Loading data into the model
[2023-11-22 14:17:37,064 INFO] Total translation time (s): 10.5
[2023-11-22 14:17:37,064 INFO] Average translation time (ms): 5269.4
[2023-11-22 14:17:37,064 INFO] Tokens per second: 48.6
Time w/o python interpreter load/terminate: 16.660045623779297
So difficult to understand the supposed x4 speedup.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
cc @matthewdouglas for visibility
I tested the inference speed of LLaMa-7B with bitsandbutes-0.40 on A100-80G. I fonud that the speed of
nf4
has been greatly improved thah Qlora. However, the speed ofnf4
is still slower thanfp16
.I conducted an inference speed test on LLaMa-7B using bitsandbytes-0.40 with A100-80G. I found that the speed of
nf4
has been significantly improved compared to Qlora. However, the speed ofnf4
is still slower thanfp16
.Specifically, I evaluated the speed with the following code:
the output is:
The above results show that
nf4
is only approximately0.6x
the speed offp16
. I would like to know how to achieve the claimed3.4x
speedup as mentioned in this link.