Open shaojun opened 1 month ago
flash-attn
is needed to reproduce the GPU Memory
also, make sure using attn_implementation="flash_attention_2"
when loading model
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
device_map="auto",
)
@kq-chen thanks.
But i noticed in readme
:
FlashAttention-2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.
What does it mean?using flash attention would disable loading quantization models?
sorry for the confusion. FlashAttention-2 is compatible with quantized models. (actually, the attention calculation is performed in fp16 data type).
@kq-chen the flash attention seems only support limited devices, and for consumer level GPU is only RTX3090 and RTX4090:
github.com/Dao-AILab/flash-attention?tab=readme-ov-file#nvidia-cuda-support
does this mean Qwen2-VL 7B with Int8 quantized
is not able to run any lower gpu card(RTX4080 with 16G memory and etc) ?
The input image context is also a reason for the high GPU consumption use the following code to optimize the image and it will consume the GPU less.
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)
I tried to load my custom GPTQ quantized Qwen2-VL model but got the error. I am using Windows. I have explained the issue here : https://github.com/QwenLM/Qwen2-VL/issues/520
Any updates though, it seems awq or gptq not working.
hi,
I basically followed: https://www.modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8
and thought the
24G
gpu memory would be enough for the model:but got error when trying to run the
python
script, the error was shown:this is my
nvidia-smi
:and part of
pip list
:And I can smoothly ran through the
Qwen2-VL-7B-Instruct-GPTQ-Int4
model.thanks.