QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
2.7k stars 155 forks source link

Bug in beam_search: memory allocation #334

Open greeksharifa opened 2 weeks ago

greeksharifa commented 2 weeks ago

If run this code:

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    cache_dir="/model/Qwen/"
)
(...)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "/data/NExTQA/NExTVideo/total/9996338863.mp4",
                # "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
(...)
outputs = model.generate(**inputs, max_new_tokens=10, return_dict_in_generate=True, output_scores=True, do_sample=True, num_beams=2)

Then OutofMemoryError occurs:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 538976288.13 GiB. GPU 1 has a total capacity of 47.54 GiB of which 43.55 GiB is free. Process 741857 has 3.97 GiB memory in use. Of the allocated memory 3.19 GiB is allocated by PyTorch, and 478.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

in 6 x A6000 GPUs.

The Tried to allocate 538976288.13 GiB. message in error message is such ridiculous numeric.

p.s. following code works, in our GPU server(this code works in only 1 Nvidia A6000).

outputs = model.generate(**inputs, max_new_tokens=10, return_dict_in_generate=True, output_scores=True)
HemanthoDarwin-RabertKennedy-ZS0532 commented 13 hours ago

Use the Image optimization to do the minimal usage of GPU during inferencing.The code for the image optimization is as follow:

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)