Closed abhijitherekar closed 9 months ago
Hi, thanks for your interesting and response!
Do you mean inference in a quantized format or training using QLoRA which is not supported for Imp yet?
If you mean inference, check if use_cache
is enabled, and check if the model stop generation after a </s>
token is generated.
Hi @ParadoxZW , thanks for the reply.
I am currently using it for inference. Here is what I have tried:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
#Create model
model = AutoModelForCausalLM.from_pretrained(
"MILVLG/imp-v1-3b",
torch_dtype=torch.float16,
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True)
and then while calling forward pass, I set the `use_cache` to true:
output_ids = model.generate(
input_ids,
max_new_tokens=100,
images=image_tensor,
use_cache=True)[0]
When I load the 4-bit model, I see the latency to be 1.2 seconds.
But if I load the normal model it takes 0.2 seconds for the same image.
I will check further on the token generation.
If you can point to the file where I need to check and try, I can give it a shot in fixing it if you think its a issue.
Open to help and contribute.
Thanks
I check for EOS token, I don't see it generated in the output.
So, I am not sure why a 4-bit quantized model would take more time for inference than the base model.
Are we missing anything ? @ParadoxZW , what else can I check ?
you can check the output to see whether the quantized model generate much longer outputs than the base model.
@abhijitherekar We have invesitgated the quantization methods recently. According to the existing studies, using int8 or other 4-bit quantization strategies will indeed slow down the inference speed. More explanation may be referred to the paper "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale".
If you really find some bugs in our code, please let us know.
Hi @MIL-VLG , thanks for the response.
To summaries on what I understand: So, this int8 quantization makes the slower models go slower even further. But, this doesn't happen with larger models like llava. Is that right ?
Please, share your thoughts. Thanks
Hi, I am trying to see if I can load a quantized model of this.
When I load in 4-bit, the model size is smaller but the latency significantly increases.
Not sure if there needs to be any changes to be done to support quantization.
Please, let me know.
I can also help in creating a MR to make the quantized model better.
Thanks