bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.34k stars 636 forks source link

quantization of T5 faild. int8 model cost more inference time and memory. #1262

Open Worromots opened 5 months ago

Worromots commented 5 months ago

System Info

A100-80G cuda12.1 bitsandbytes 0.43.2.dev0
diffusers 0.29.1 lion-pytorch 0.2.2 torch 2.0.1 torch-tb-profiler 0.4.3 torchvision 0.16.1+cu121 xformers 0.0.22 transformers 4.31.0

Reproduction

load code

torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True model_name=''google/flan-t5-xxl'' quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_threshold=5.1)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir,torch_dtype=torch.float16,).to(device).eval()

         model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
                                                      cache_dir=cache_dir,
                                                     #   torch_dtype=torch.float16,
                                                       quantization_config=quantization_config)

inference code

    text_tokens_and_mask = self.tokenizer(
        texts,
        max_length=self.model_max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors='pt'
    )

    text_tokens_and_mask['input_ids'] = text_tokens_and_mask['input_ids']
    text_tokens_and_mask['attention_mask'] = text_tokens_and_mask['attention_mask']

    self.prof.step()
    with torch.no_grad():
        text_encoder_embs = self.model(
            input_ids=text_tokens_and_mask['input_ids'].to(self.device),
            attention_mask=text_tokens_and_mask['attention_mask'].to(self.device),
        )['last_hidden_state'].detach()
    return text_encoder_embs, text_tokens_and_mask['attention_mask'].to(self.device)

Expected behavior

17443MiB load in int8 with BitsAndBytesConfig. T5 encoder cost time: 96ms 11759MiB load in float16. T5 encoder cost time: 21ms

the int8 model cost more inference time and memory than fp16 model.

image

the torch profile show the quantized model is using the intmul kernel.

weibaozi commented 1 month ago

Hi, did you solve the issue?

Titus-von-Koeller commented 1 month ago

cc @matthewdouglas