bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.18k stars 619 forks source link

[Question] How to obtain the 4-bit inference speedup? #611

Closed ChenMnZ closed 9 months ago

ChenMnZ commented 1 year ago

I tested the inference speed of LLaMa-7B with bitsandbutes-0.40 on A100-80G. I fonud that the speed of nf4 has been greatly improved thah Qlora. However, the speed of nf4 is still slower than fp16.

I conducted an inference speed test on LLaMa-7B using bitsandbytes-0.40 with A100-80G. I found that the speed of nf4 has been significantly improved compared to Qlora. However, the speed of nf4 is still slower thanfp16.

Specifically, I evaluated the speed with the following code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
import time
import pdb

MAX_NEW_TOKENS = 128
model_name = 'path/to/llama-7b'

text = 'Hamburg is in which country?\n'
tokenizer = AutoTokenizer.from_pretrained(model_name,use_fast=False)
input_ids = tokenizer(text, return_tensors="pt").input_ids
input_ids = input_ids.cuda()

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map='auto',
  max_memory=max_memory,
)
time_1 = time.time()
num = 0
for i in range(5):
  generated_ids = model.generate(input_ids, max_length=MAX_NEW_TOKENS)
  num += len(generated_ids[0])

print(f"fp16 speed: {num/(time.time() - time_1)}token/s")

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map='auto',
  load_in_4bit=True,
  max_memory=max_memory,
  quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type='nf4'
        ),
)
time_1 = time.time()
num = 0
for i in range(5):
  generated_ids = model.generate(input_ids, max_length=MAX_NEW_TOKENS)
  num += len(generated_ids[0])
print(f"nf4 speed: {num/(time.time() - time_1)}token/s")

the output is:

fp16 speed: 36.28083223507306token/s
nf4 speed: 21.665383178407968token/s

The above results show that nf4 is only approximately 0.6x the speed of fp16. I would like to know how to achieve the claimed 3.4x speedup as mentioned in this link.

TimDettmers commented 1 year ago

Thank you for reporting this. In my setup, the fp16 speed is closer to the nf4 speed if you run with more tokens (to smooth out the variance), but the main problem is that, for some reason, the register pressure is higher on A100 than on other GPUs. I never directly benchmarked on A100s, and this was unexpected. The high register pressure leads to an occupancy of 40%, which basically leads to a slowdown of 2.5x.

I will need to create a work-around. A simple work-around with __launch_bounds__ does not seem to help. As such, this will be a more complicated fix.

filipemesquita commented 1 year ago

I have also seen a slowdown in my tests using bitsandbytes' 4-bit and 8-bit quantization on an A100 80G (bnb: 0.41.0, CUDA Version: 11.7).

open_llama_3B + LoRA on A100 (HF, 1 beam, float16): ~23 t/s open_llama_3B + LoRA on A100 (HF, 1 beam, bitsandbytes 4bit 0.41.0): ~16 t/s open_llama_3B + LoRA on A100 (HF, 1 beam, bitsandbytes 8bit 0.41.0): ~7 t/s

Screen Shot 2023-07-28 at 10 56 51 AM

And these are the numbers after running @ChenMnZ's script:

for openlm-research/open_llama_3b fp16 speed: 56.32390258485641token/s nf4 speed: 25.252466712656336token/s

for yahma/llama-7b-hf fp16 speed: 44.824235393537535token/s nf4 speed: 20.757886320825538token/s

Note: I am getting similar numbers for bnb 0.40.2 as well.

junzhang-zj commented 1 year ago

@filipemesquita @ChenMnZ I was wondering if you have achieved a proper speedup with 4bit? It still bothers me a lot.

Rahu218 commented 1 year ago

@ChenMnZ Hi, have you found any quantization methods to speed up the inference time for Llama 2 (7B) model, as this nf4 inference speed is comparatively low?

ChenMnZ commented 1 year ago

@Rahu218 Yes, mlc-llm can compile the quantized model and achieve nearly 2x speedup. For more details, you can refer my recent work OmniQuant, and see this file. However, there are some problem with mlc-llm so that it can not run 3-bit models successfully, but you can try the 4-bit quantization by yourself.

Also, AWQ can also achieve a significant speedup.

Rahu218 commented 1 year ago

thanks for the reply @ChenMnZ, I am currently working of speed up the inference time for my Llama 2 (7B) model, with Bits and Bytes Quantization. the code is provided below!!

can you help me by guiding on how to use this OmniQuant technique for my case to lower the inferece time, i am running the model on Google Colab pro.

Original inference time was 35 sec Inference time after Quantization according to the below code is: 60sec.

code: name = "meta-llama/Llama-2-13b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, num_return_sequences=1, do_sample=True, eos_token_id=tokenizer.eos_token_id, device_map="auto", # finds GPU max_length=2000, top_k=10, top_p=0.9, temperature = 0.8, batch_size=1, )

llm = HuggingFacePipeline(pipeline = generation_pipe)

vince62s commented 11 months ago

I am also seeing slower speed with 4bit vs FP16 with OpenNMT-py / Mistral batch_size=1

NF4
[2023-11-22 14:19:34,537 INFO] Loading checkpoint from mistral-7B/mistral-sft_step_1000.pt
[2023-11-22 14:19:38,534 INFO] bnb_NF4 compression of layer ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
[2023-11-22 14:19:39,276 INFO] Loading data into the model
[2023-11-22 14:19:55,469 INFO] Total translation time (s): 14.0
[2023-11-22 14:19:55,469 INFO] Average translation time (ms): 7004.4
[2023-11-22 14:19:55,469 INFO] Tokens per second: 36.5
Time w/o python interpreter load/terminate:  20.941147565841675

FP16
[2023-11-22 14:17:20,412 INFO] Loading checkpoint from mistral-7B/mistral-sft_step_1000.pt
[2023-11-22 14:17:24,415 INFO] Loading data into the model
[2023-11-22 14:17:37,064 INFO] Total translation time (s): 10.5
[2023-11-22 14:17:37,064 INFO] Average translation time (ms): 5269.4
[2023-11-22 14:17:37,064 INFO] Tokens per second: 48.6
Time w/o python interpreter load/terminate:  16.660045623779297

So difficult to understand the supposed x4 speedup.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Titus-von-Koeller commented 2 months ago

cc @matthewdouglas for visibility