Closed MarkSchmidty closed 10 months ago
Mark, the first author of the paper you cited is the author of bitsandbytes repository
I'm aware. It's been 3 months since publication with no mention of 4bit in this repo and no open issue.
@TimDettmers If I'm not mistaken, you indicated in another thread that int4 support was in the works -- is that correct?
- How complicated is to add 4 bit quantization? It sounds like just a change in couple of parameters?
- Does 4bit also improve speed. From Nvidia post, it seems that on Turing it should. Not sure if on M1.
With bitsandbytes, int8 has a dramatic slowdown compared to FP16 and you can expect that to be even worse with INT4. There's something going on with bitsandbytes that makes it pretty slow. You would expect INT8 and 4 to be actually faster compared to FP16 inference. https://picture.iczhiku.com/weixin/weixin159185416907010.png
In my testing running a 2.7b parameter language model on my RTX 2060 laptop, I can confirm 8bit using bitsandbytes is dramatically slower compared to FP16. With the 2.7B model loaded entirely in VRAM using bitsandbytes its speed was around 4 tokens/sec while with FP16 and 26/33 GPU layers it was close to 6 token/s.
@TimDettmers Maybe there is something in the code that can be improved? I wish HF would just put a team behind you and implement the results in HF accelerate.
In theory, performance would linearly increase as precision is decreased, but I don't know what makes bitsandbytes much slower exactly. On my configuration (RTX3090), for the same language model and prompt I get about 0.33x inference performance with INT8 and bitsandbytes relatively to the FP16 Huggingface implementation.
For comparison, llama.cpp running inference in 4bit mode is about 3x faster than in 16bit mode for the same model.
This is closer to what I would expect.
Super excited for bnb int4. Any news on when it'll be out?
Just so you guys know they have added the 4bit-inference but only for batch size 1, which is awesome!
https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0
However, like some on this thread, I still face similar issues regarding the slower inference. It seems FP16 is still faster. But I'm using NLI models.
@TimDettmers thanks for the great int4 :) would be awesome if int4 variants got added to https://github.com/TimDettmers/bitsandbytes/blob/main/benchmarking/switchback/speed_benchmark.py ?
what should be the correct way of splitting the batch into several batch_size=1 requests? just launching them in a loop? or launching them in parallel on several CUDA threads?
is inner dim of 4096 required? or smaller inner dims like 512/2048 are also enjoying speed-ups?
@TimDettmers Thanks for the very easy-to-use 4bits inference. Very Friendly!
I try to reproduce the 3.4x speedup on A100 using a LLaMA 7B model, as described in the release docs as below.
However, like some on this thread, I also face similar issues regarding the slower inference. It seems FP16 is faster.
Experiments on LAMBADA validation dataset is below.
LLM.int8
is 15% of FP16.LLM.FP4
is 56% of FP16.LLM.NF4
is 56% of FP16.Maybe I didn't use 4 bits in the right way. The codes is below. Wish for your reply. Many thanks!
import torch
from datasets import load_dataset
from transformers.models.llama import LlamaForCausalLM, LlamaTokenizer
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import BitsAndBytesConfig
class Evaluator:
def __init__(self, dataset, tokenizer, device, pad_token: str = None):
self.dataset = dataset
self.device = device
if pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
else:
tokenizer.pad_token = pad_token # e.g. "[PAD]" for llama
tokenizer.padding_side = "left"
# tokenize the dataset
def tokenize_function(examples):
prompts = tokenizer(
examples["text"], return_tensors="pt", padding=True
)
return {
"input_ids": prompts["input_ids"][:, :-1], # np
"attention_mask": prompts["attention_mask"][:, :-1],
"labels": prompts["input_ids"][:, -1],
}
self.dataset = self.dataset.map(tokenize_function, batched=True)
self.dataset.set_format(
type="torch", columns=["input_ids", "attention_mask", "labels"]
)
@torch.no_grad()
def evaluate(self, model, batch_size: int = 1):
model.eval()
total, hit = 0, 0
dataloader = DataLoader(self.dataset, batch_size)
tbar = tqdm(dataloader, desc="acc")
for batch in tbar:
input_ids = batch["input_ids"].to(self.device)
attention_mask = batch["attention_mask"].to(self.device)
label = batch["labels"] # torch.Tensor
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
last_token_logits = outputs.logits[:, -1, :] # [8, 116, 50432]
pred = last_token_logits.argmax(dim=-1).cpu()
hit += (pred == label).sum().item()
total += label.size(0)
acc = hit / total
tbar.set_description(f"acc: {acc:.3f}")
return acc
model_path = "decapoda-research/llama-7b-hf"
tokenizer = LlamaTokenizer.from_pretrained(model_path)
val_dataset = load_dataset("lambada", split="validation[:1000]")
val_evaluator = Evaluator(val_dataset, tokenizer, "cuda", pad_token="[PAD]")
# FP16
model = LlamaForCausalLM.from_pretrained(
model_path, device_map="auto", torch_dtype=torch.float16
)
# INT8
model = LlamaForCausalLM.from_pretrained(
model_path, device_map="auto", load_in_8bit=True
)
# FP4
fp4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="fp4",
bnb_4bit_use_double_quant=False, # save an additional 0.4 bits per parameter.
bnb_4bit_compute_dtype=torch.float16,
llm_int8_skip_modules=["lm_head"],
)
model = LlamaForCausalLM.from_pretrained(
model_path, device_map="auto", quantization_config=fp4_config
)
acc = val_evaluator.evaluate(model, batch_size=1)
print(f"quantized model accuracy: {acc}")
@Shuai-Xie I would like to ask if llm_int8_skip_modules=["lm_head"], still works for 4bit?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Will there also be support for int4, in addition to fp4 and nf4?
@TimDettmers Thanks for the very easy-to-use 4bits inference. Very Friendly!
I try to reproduce the 3.4x speedup on A100 using a LLaMA 7B model, as described in the release docs as below.
However, like some on this thread, I also face similar issues regarding the slower inference. It seems FP16 is faster.
Experiments on LAMBADA validation dataset is below.
- The speed of
LLM.int8
is 15% of FP16.- The speed of
LLM.FP4
is 56% of FP16.- The speed of
LLM.NF4
is 56% of FP16.Maybe I didn't use 4 bits in the right way. The codes is below. Wish for your reply. Many thanks!
import torch from datasets import load_dataset from transformers.models.llama import LlamaForCausalLM, LlamaTokenizer import torch from torch.utils.data import DataLoader from tqdm import tqdm from transformers import BitsAndBytesConfig class Evaluator: def __init__(self, dataset, tokenizer, device, pad_token: str = None): self.dataset = dataset self.device = device if pad_token is None: tokenizer.pad_token = tokenizer.eos_token else: tokenizer.pad_token = pad_token # e.g. "[PAD]" for llama tokenizer.padding_side = "left" # tokenize the dataset def tokenize_function(examples): prompts = tokenizer( examples["text"], return_tensors="pt", padding=True ) return { "input_ids": prompts["input_ids"][:, :-1], # np "attention_mask": prompts["attention_mask"][:, :-1], "labels": prompts["input_ids"][:, -1], } self.dataset = self.dataset.map(tokenize_function, batched=True) self.dataset.set_format( type="torch", columns=["input_ids", "attention_mask", "labels"] ) @torch.no_grad() def evaluate(self, model, batch_size: int = 1): model.eval() total, hit = 0, 0 dataloader = DataLoader(self.dataset, batch_size) tbar = tqdm(dataloader, desc="acc") for batch in tbar: input_ids = batch["input_ids"].to(self.device) attention_mask = batch["attention_mask"].to(self.device) label = batch["labels"] # torch.Tensor outputs = model(input_ids=input_ids, attention_mask=attention_mask) last_token_logits = outputs.logits[:, -1, :] # [8, 116, 50432] pred = last_token_logits.argmax(dim=-1).cpu() hit += (pred == label).sum().item() total += label.size(0) acc = hit / total tbar.set_description(f"acc: {acc:.3f}") return acc model_path = "decapoda-research/llama-7b-hf" tokenizer = LlamaTokenizer.from_pretrained(model_path) val_dataset = load_dataset("lambada", split="validation[:1000]") val_evaluator = Evaluator(val_dataset, tokenizer, "cuda", pad_token="[PAD]") # FP16 model = LlamaForCausalLM.from_pretrained( model_path, device_map="auto", torch_dtype=torch.float16 ) # INT8 model = LlamaForCausalLM.from_pretrained( model_path, device_map="auto", load_in_8bit=True ) # FP4 fp4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="fp4", bnb_4bit_use_double_quant=False, # save an additional 0.4 bits per parameter. bnb_4bit_compute_dtype=torch.float16, llm_int8_skip_modules=["lm_head"], ) model = LlamaForCausalLM.from_pretrained( model_path, device_map="auto", quantization_config=fp4_config ) acc = val_evaluator.evaluate(model, batch_size=1) print(f"quantized model accuracy: {acc}")
hi, have you found the issue?why is the int8 model slower than float16 so much?can you share your fixed solutions?
In theory, performance would linearly increase as precision is decreased, but I don't know what makes bitsandbytes much slower exactly. On my configuration (RTX3090), for the same language model and prompt I get about 0.33x inference performance with INT8 and bitsandbytes relatively to the FP16 Huggingface implementation.
hi,i encounter the same issue,have you found the reason why is the int8 model so slowely? and do you any solutions ?
https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."
According to the case for 4-bit precision paper there are essentially only upsides to 4-bit quantization. Performance per bit goes up, speed (in some cases) goes up, and performance per parameter stays around the same, just as with 8-bit.
There are already bleeding edge 4-bit quantization efforts such as GPTQ for LLaMA
Unfortunately these efforts are only academic PoC and are not useful for inference.
Bitsandbytes is already used by many projects for its 8-bit implementation. By adding 4-bit these and other projects could benefit the same as they do now, except doubly so. Smaller models, more model accessibility, and better performance.