NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.78k stars 1.01k forks source link

Qwen2-VL Batch Bug #2495

Open LugerW-A opened 4 days ago

LugerW-A commented 4 days ago

System Info

x86 Tensorrt_LLM 0.16.0

Who can help?

No response

Information

Tasks

Reproduction

Qwen2-VL examples

Expected behavior

Dose Qwen2-VL support batch prompt? When the input is a batch, only the first result returns correctly, while the rest are all empty. print(input_ids.shape) print(prompt_table.shape) print(prompt_tasks) outputs = self.model.generate( input_ids, input_position_ids=None, mrope_params=mrope_params, sampling_config=None, prompt_table=prompt_table, prompt_tasks=prompt_tasks, max_new_tokens=max_new_tokens, end_id=end_id, pad_id=self.model.tokenizer.pad_token_id if self.model.tokenizer.pad_token_id is not None else self.model.tokenizer.all_special_ids[0], top_k=self.args.top_k, top_p=self.args.top_p, temperature=self.args.temperature, repetition_penalty=self.args.repetition_penalty, num_beams=self.args.num_beams, output_sequence_lengths=True, return_dict=True)

actual behavior

input_ids only differ in the first dimension, but the results are incorrect(empty).

additional notes

none

sunnyqgg commented 3 days ago

Hi @LugerW-A , it supports batch inference, and you need to follow the batch process provided by official QWen2-VL, please see more info at: https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file , like: messages1 = [ { "role": "user", "content": [ {"type": "image", "image": "xxx/image1.jpg"}, {"type": "text", "text": "Describe this picture?"}, ], } ] messages2 = [ { "role": "user", "content": [ {"type": "image", "image": "xxxx/image2.jpg"}, {"type": "text", "text": "Describe this picture? and what kind of coulor doese it containe?"}, ], } ] messages = [messages1, messages2] texts = [ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages ] image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda")

sun2011yao commented 19 hours ago

@sunnyqgg HI, referring to the above writing method, the second output is empty. Have you printed the second output result? When I run it here, it shows that the second output is all eos_token_id

sunnyqgg commented 17 hours ago

HI @sun2011yao do you specify the --batch_size when running with multi batch?

sun2011yao commented 17 hours ago

HI @sun2011yao do you specify the --batch_size when running with multi batch?

yes, run the command as follows: python3 run.py \ --hf_model_dir ./${MODEL_NAME} \ --batch_size 2 \ --image_path ./pics/demo.jpeg \ --run_profiling \ --max_new_tokens 50 \ --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \ --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/

sunnyqgg commented 16 hours ago

Hi, If you add messages = [messages1, messages2] like above for default , so please don't add --image_path ./pics/demo.jpeg, otherwise it will don't work, I'll add multi batch by specifying multi values for --image_path later.

sun2011yao commented 15 hours ago

Hi, If you add messages = [messages1, messages2] like above for default , so please don't add --image_path ./pics/demo.jpeg, otherwise it will don't work, I'll add multi batch by specifying multi values for --image_path later.

HI, i removed --image_path, but second result still empty. [['The image shows a woman sitting on a sandy beach with a dog. The dog is wearing a colorful harness and is sitting on its hind legs, giving a high-five to the woman. The woman is wearing a plaid shirt and is smiling. The'], ['']]

Can you get the correct results there?