QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
3.18k stars 200 forks source link

CUDA out of memory for Qwen2-VL-7B-Instruct-GPTQ-Int8 on RTX3090 24G #297

Open shaojun opened 1 month ago

shaojun commented 1 month ago

hi,
I basically followed: https://www.modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8
and thought the 24G gpu memory would be enough for the model:
image

but got error when trying to run the python script, the error was shown:

...
...
  File "/home/shao/miniconda3/envs/qwen2-vl/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 377, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/home/shao/miniconda3/envs/qwen2-vl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shao/miniconda3/envs/qwen2-vl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/shao/miniconda3/envs/qwen2-vl/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 350, in forward
    attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
orch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.10 GiB. GPU 0 has a total capacity of 23.69 GiB of which 5.83 GiB is free. Including non-PyTorch memory, this process has 17.62 GiB memory in use. Of the allocated memory 17.05 GiB is allocated by PyTorch, and 274.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

this is my nvidia-smi:

nvidia-smi
Sun Sep 29 16:59:59 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0  On |                  N/A |
| 30%   32C    P8              30W / 350W |    239MiB / 24576MiB |     14%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1530      G   /usr/lib/xorg/Xorg                           35MiB |
|    0   N/A  N/A      3742      G   /usr/lib/xorg/Xorg                           98MiB |
|    0   N/A  N/A      4067      G   /usr/bin/gnome-shell                         80MiB |
+---------------------------------------------------------------------------------------+

and part of pip list:

Package                  Version
------------------------ -----------
accelerate               0.34.2
auto_gptq                0.7.1
decord                   0.6.0
optimum                  1.22.0
tokenizers               0.19.1
torch                    2.4.1
torchvision              0.19.1
tqdm                     4.66.5
transformers             4.45.0.dev0

And I can smoothly ran through the Qwen2-VL-7B-Instruct-GPTQ-Int4 model.

thanks.

kq-chen commented 1 month ago

flash-attn is needed to reproduce the GPU Memory

also, make sure using attn_implementation="flash_attention_2" when loading model

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
shaojun commented 1 month ago

@kq-chen thanks.

But i noticed in readme:

FlashAttention-2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.

What does it mean?using flash attention would disable loading quantization models?

kq-chen commented 1 month ago

sorry for the confusion. FlashAttention-2 is compatible with quantized models. (actually, the attention calculation is performed in fp16 data type).

shaojun commented 1 month ago

@kq-chen the flash attention seems only support limited devices, and for consumer level GPU is only RTX3090 and RTX4090: github.com/Dao-AILab/flash-attention?tab=readme-ov-file#nvidia-cuda-support
does this mean Qwen2-VL 7B with Int8 quantized is not able to run any lower gpu card(RTX4080 with 16G memory and etc) ?

HemanthoDarwin-RabertKennedy-ZS0532 commented 4 weeks ago

The input image context is also a reason for the high GPU consumption use the following code to optimize the image and it will consume the GPU less.

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)
bhavyajoshi-mahindra commented 1 week ago

I tried to load my custom GPTQ quantized Qwen2-VL model but got the error. I am using Windows. I have explained the issue here : https://github.com/QwenLM/Qwen2-VL/issues/520

lebronjamesking commented 1 week ago

Any updates though, it seems awq or gptq not working.