Bug in beam_search: memory allocation

If run this code:

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    cache_dir="/model/Qwen/"
)
(...)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "/data/NExTQA/NExTVideo/total/9996338863.mp4",
                # "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
(...)
outputs = model.generate(**inputs, max_new_tokens=10, return_dict_in_generate=True, output_scores=True, do_sample=True, num_beams=2)

Then OutofMemoryError occurs:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 538976288.13 GiB. GPU 1 has a total capacity of 47.54 GiB of which 43.55 GiB is free. Process 741857 has 3.97 GiB memory in use. Of the allocated memory 3.19 GiB is allocated by PyTorch, and 478.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

in 6 x A6000 GPUs.

The Tried to allocate 538976288.13 GiB. message in error message is such ridiculous numeric.

p.s. following code works, in our GPU server(this code works in only 1 Nvidia A6000).

outputs = model.generate(**inputs, max_new_tokens=10, return_dict_in_generate=True, output_scores=True)

QwenLM / Qwen2-VL

Bug in beam_search: memory allocation #334