enhancement: Add 4-bit quantization / inference support

MarkSchmidty commented 1 year ago

https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."

According to the case for 4-bit precision paper there are essentially only upsides to 4-bit quantization. Performance per bit goes up, speed (in some cases) goes up, and performance per parameter stays around the same, just as with 8-bit.

There are already bleeding edge 4-bit quantization efforts such as GPTQ for LLaMA

GPTQ is currently the SOTA one shot quantization method for LLMs. GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa. I've actually confirmed that this works well in LLaMa 7b. I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.

Model(LLaMa-7B) Bits group-size Wikitext2 PTB C4 FP16 16 - 5.67 8.79 7.05 RTN 4 - 6.28 9.68 7.70 GPTQ 4 64 6.16 9.66 7.52 RTN 3 - 25.66 61.25 28.19 GPTQ 3 64 12.24 16.77 9.55 code: https://github.com/qwopqwop200/GPTQ-for-LLaMa From: https://github.com/oobabooga/text-generation-webui/issues/177

Unfortunately these efforts are only academic PoC and are not useful for inference.

Bitsandbytes is already used by many projects for its 8-bit implementation. By adding 4-bit these and other projects could benefit the same as they do now, except doubly so. Smaller models, more model accessibility, and better performance.

MarkSchmidty commented 1 year ago

Mark, the first author of the paper you cited is the author of bitsandbytes repository

I'm aware. It's been 3 months since publication with no mention of 4bit in this repo and no open issue.

dustydecapod commented 1 year ago

@TimDettmers If I'm not mistaken, you indicated in another thread that int4 support was in the works -- is that correct?

vackosar commented 1 year ago

How complicated is to add 4 bit quantization? It sounds like just a change in couple of parameters?
Does 4bit also improve speed. From Nvidia post, it seems that on Turing it should. Not sure if on M1.

Dampfinchen commented 1 year ago

How complicated is to add 4 bit quantization? It sounds like just a change in couple of parameters?

Does 4bit also improve speed. From Nvidia post, it seems that on Turing it should. Not sure if on M1.

With bitsandbytes, int8 has a dramatic slowdown compared to FP16 and you can expect that to be even worse with INT4. There's something going on with bitsandbytes that makes it pretty slow. You would expect INT8 and 4 to be actually faster compared to FP16 inference. https://picture.iczhiku.com/weixin/weixin159185416907010.png

In my testing running a 2.7b parameter language model on my RTX 2060 laptop, I can confirm 8bit using bitsandbytes is dramatically slower compared to FP16. With the 2.7B model loaded entirely in VRAM using bitsandbytes its speed was around 4 tokens/sec while with FP16 and 26/33 GPU layers it was close to 6 token/s.

@TimDettmers Maybe there is something in the code that can be improved? I wish HF would just put a team behind you and implement the results in HF accelerate.

BugReporterZ commented 1 year ago

In theory, performance would linearly increase as precision is decreased, but I don't know what makes bitsandbytes much slower exactly. On my configuration (RTX3090), for the same language model and prompt I get about 0.33x inference performance with INT8 and bitsandbytes relatively to the FP16 Huggingface implementation.

MarkSchmidty commented 1 year ago

For comparison, llama.cpp running inference in 4bit mode is about 3x faster than in 16bit mode for the same model.

This is closer to what I would expect.

practical-dreamer commented 1 year ago

Super excited for bnb int4. Any news on when it'll be out?

Martins6 commented 1 year ago

Just so you guys know they have added the 4bit-inference but only for batch size 1, which is awesome!

https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0

However, like some on this thread, I still face similar issues regarding the slower inference. It seems FP16 is still faster. But I'm using NLI models.

vadimkantorov commented 1 year ago

@TimDettmers thanks for the great int4 :) would be awesome if int4 variants got added to https://github.com/TimDettmers/bitsandbytes/blob/main/benchmarking/switchback/speed_benchmark.py ?

what should be the correct way of splitting the batch into several batch_size=1 requests? just launching them in a loop? or launching them in parallel on several CUDA threads?

is inner dim of 4096 required? or smaller inner dims like 512/2048 are also enjoying speed-ups?

Shuai-Xie commented 1 year ago

@TimDettmers Thanks for the very easy-to-use 4bits inference. Very Friendly!

I try to reproduce the 3.4x speedup on A100 using a LLaMA 7B model, as described in the release docs as below.

However, like some on this thread, I also face similar issues regarding the slower inference. It seems FP16 is faster.

Experiments on LAMBADA validation dataset is below.

The speed of LLM.int8 is 15% of FP16.
The speed of LLM.FP4 is 56% of FP16.
The speed of LLM.NF4 is 56% of FP16.

Maybe I didn't use 4 bits in the right way. The codes is below. Wish for your reply. Many thanks!

import torch
from datasets import load_dataset
from transformers.models.llama import LlamaForCausalLM, LlamaTokenizer

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm

from transformers import BitsAndBytesConfig

class Evaluator:
    def __init__(self, dataset, tokenizer, device, pad_token: str = None):
        self.dataset = dataset
        self.device = device
        if pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        else:
            tokenizer.pad_token = pad_token  # e.g. "[PAD]" for llama
        tokenizer.padding_side = "left"

        # tokenize the dataset
        def tokenize_function(examples):
            prompts = tokenizer(
                examples["text"], return_tensors="pt", padding=True
            )
            return {
                "input_ids": prompts["input_ids"][:, :-1],  # np
                "attention_mask": prompts["attention_mask"][:, :-1],
                "labels": prompts["input_ids"][:, -1],
            }

        self.dataset = self.dataset.map(tokenize_function, batched=True)
        self.dataset.set_format(
            type="torch", columns=["input_ids", "attention_mask", "labels"]
        )

    @torch.no_grad()
    def evaluate(self, model, batch_size: int = 1):
        model.eval()
        total, hit = 0, 0
        dataloader = DataLoader(self.dataset, batch_size)
        tbar = tqdm(dataloader, desc="acc")
        for batch in tbar:
            input_ids = batch["input_ids"].to(self.device)
            attention_mask = batch["attention_mask"].to(self.device)
            label = batch["labels"]  # torch.Tensor
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            last_token_logits = outputs.logits[:, -1, :]  # [8, 116, 50432]
            pred = last_token_logits.argmax(dim=-1).cpu()
            hit += (pred == label).sum().item()
            total += label.size(0)
            acc = hit / total
            tbar.set_description(f"acc: {acc:.3f}")
        return acc

model_path = "decapoda-research/llama-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_path)
val_dataset = load_dataset("lambada", split="validation[:1000]")
val_evaluator = Evaluator(val_dataset, tokenizer, "cuda", pad_token="[PAD]")

# FP16
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto", torch_dtype=torch.float16
)

# INT8
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto", load_in_8bit=True
)

# FP4
fp4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_use_double_quant=False,  # save an additional 0.4 bits per parameter.
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["lm_head"],
)
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto", quantization_config=fp4_config
)

acc = val_evaluator.evaluate(model, batch_size=1)
print(f"quantized model accuracy: {acc}")

junzhang-zj commented 1 year ago

@Shuai-Xie I would like to ask if llm_int8_skip_modules=["lm_head"], still works for 4bit?

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

puyuanOT commented 10 months ago

Will there also be support for int4, in addition to fp4 and nf4?

ganliqiang commented 7 months ago

@TimDettmers Thanks for the very easy-to-use 4bits inference. Very Friendly!

I try to reproduce the 3.4x speedup on A100 using a LLaMA 7B model, as described in the release docs as below.

However, like some on this thread, I also face similar issues regarding the slower inference. It seems FP16 is faster.

Experiments on LAMBADA validation dataset is below.

The speed of LLM.int8 is 15% of FP16.
The speed of LLM.FP4 is 56% of FP16.
The speed of LLM.NF4 is 56% of FP16.

Maybe I didn't use 4 bits in the right way. The codes is below. Wish for your reply. Many thanks!

import torch
from datasets import load_dataset
from transformers.models.llama import LlamaForCausalLM, LlamaTokenizer

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm

from transformers import BitsAndBytesConfig

class Evaluator:
    def __init__(self, dataset, tokenizer, device, pad_token: str = None):
        self.dataset = dataset
        self.device = device
        if pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        else:
            tokenizer.pad_token = pad_token  # e.g. "[PAD]" for llama
        tokenizer.padding_side = "left"

        # tokenize the dataset
        def tokenize_function(examples):
            prompts = tokenizer(
                examples["text"], return_tensors="pt", padding=True
            )
            return {
                "input_ids": prompts["input_ids"][:, :-1],  # np
                "attention_mask": prompts["attention_mask"][:, :-1],
                "labels": prompts["input_ids"][:, -1],
            }

        self.dataset = self.dataset.map(tokenize_function, batched=True)
        self.dataset.set_format(
            type="torch", columns=["input_ids", "attention_mask", "labels"]
        )

    @torch.no_grad()
    def evaluate(self, model, batch_size: int = 1):
        model.eval()
        total, hit = 0, 0
        dataloader = DataLoader(self.dataset, batch_size)
        tbar = tqdm(dataloader, desc="acc")
        for batch in tbar:
            input_ids = batch["input_ids"].to(self.device)
            attention_mask = batch["attention_mask"].to(self.device)
            label = batch["labels"]  # torch.Tensor
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            last_token_logits = outputs.logits[:, -1, :]  # [8, 116, 50432]
            pred = last_token_logits.argmax(dim=-1).cpu()
            hit += (pred == label).sum().item()
            total += label.size(0)
            acc = hit / total
            tbar.set_description(f"acc: {acc:.3f}")
        return acc

model_path = "decapoda-research/llama-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_path)
val_dataset = load_dataset("lambada", split="validation[:1000]")
val_evaluator = Evaluator(val_dataset, tokenizer, "cuda", pad_token="[PAD]")

# FP16
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto", torch_dtype=torch.float16
)

# INT8
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto", load_in_8bit=True
)

# FP4
fp4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_use_double_quant=False,  # save an additional 0.4 bits per parameter.
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["lm_head"],
)
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map="auto", quantization_config=fp4_config
)

acc = val_evaluator.evaluate(model, batch_size=1)
print(f"quantized model accuracy: {acc}")

hi, have you found the issue?why is the int8 model slower than float16 so much?can you share your fixed solutions？

ganliqiang commented 7 months ago

In theory, performance would linearly increase as precision is decreased, but I don't know what makes bitsandbytes much slower exactly. On my configuration (RTX3090), for the same language model and prompt I get about 0.33x inference performance with INT8 and bitsandbytes relatively to the FP16 Huggingface implementation.

hi,i encounter the same issue,have you found the reason why is the int8 model so slowely? and do you any solutions ?

bitsandbytes-foundation / bitsandbytes

enhancement: Add 4-bit quantization / inference support #181