Open Worromots opened 5 months ago
A100-80G cuda12.1 bitsandbytes 0.43.2.dev0 diffusers 0.29.1 lion-pytorch 0.2.2 torch 2.0.1 torch-tb-profiler 0.4.3 torchvision 0.16.1+cu121 xformers 0.0.22 transformers 4.31.0
torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True model_name=''google/flan-t5-xxl'' quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_threshold=5.1)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir, # torch_dtype=torch.float16, quantization_config=quantization_config)
text_tokens_and_mask = self.tokenizer( texts, max_length=self.model_max_length, padding='max_length', truncation=True, return_attention_mask=True, add_special_tokens=True, return_tensors='pt' ) text_tokens_and_mask['input_ids'] = text_tokens_and_mask['input_ids'] text_tokens_and_mask['attention_mask'] = text_tokens_and_mask['attention_mask'] self.prof.step() with torch.no_grad(): text_encoder_embs = self.model( input_ids=text_tokens_and_mask['input_ids'].to(self.device), attention_mask=text_tokens_and_mask['attention_mask'].to(self.device), )['last_hidden_state'].detach() return text_encoder_embs, text_tokens_and_mask['attention_mask'].to(self.device)
17443MiB load in int8 with BitsAndBytesConfig. T5 encoder cost time: 96ms 11759MiB load in float16. T5 encoder cost time: 21ms
the int8 model cost more inference time and memory than fp16 model.
the torch profile show the quantized model is using the intmul kernel.
Hi, did you solve the issue?
cc @matthewdouglas
System Info
A100-80G cuda12.1 bitsandbytes 0.43.2.dev0
diffusers 0.29.1 lion-pytorch 0.2.2 torch 2.0.1 torch-tb-profiler 0.4.3 torchvision 0.16.1+cu121 xformers 0.0.22 transformers 4.31.0
Reproduction
load code
torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True model_name=''google/flan-t5-xxl'' quantization_config = BitsAndBytesConfig(load_in_8bit=True,llm_int8_threshold=5.1)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=cache_dir,torch_dtype=torch.float16,).to(device).eval()
inference code
Expected behavior
17443MiB load in int8 with BitsAndBytesConfig. T5 encoder cost time: 96ms 11759MiB load in float16. T5 encoder cost time: 21ms
the int8 model cost more inference time and memory than fp16 model.
the torch profile show the quantized model is using the intmul kernel.