Batch inference using Llava ModelRunner is much slower than single inference

System Info

GPU: a10g

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

batch_size=4`, following https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava

`trtllm-build \
--checkpoint_dir models/trt_${MODEL_NAME}/fp16/1-gpu \
--output_dir trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 4 \
--max_input_len 924 \
--max_output_len 100 \
--max_multimodal_len 2304`

generate results using run.py:


self.model = ModelRunner.from_dir(self.args.llm_engine_dir,
                                          rank=tensorrt_llm.mpi_rank(),
                                          debug_mode=True)
        self.model_config = self.model.session._model_config
        ic(self.model_config)

output_ids = self.model.generate( input_ids.to("cpu"), sampling_config=None, prompt_table_path='prompt_table.npy', max_new_tokens=max_new_tokens, end_id=end_id, pad_id=self.tokenizer.pad_token_id, top_k=self.args.top_k, num_beams=self.args.num_beams, output_sequence_lengths=False, return_dict=False)



### Expected behavior

It should only take around 1 sec to complete, On single inference, it's around `0.30s`.

### actual behavior

It took significantly longer time to complete a batch with size of 4. Instead `self.model.generate` took `43` seconds.

### additional notes

please guide if there is any recommendations. Would there be any issue in `tensorrt_llm.runtime.ModelRunner`, or potential CUDAStreamSynchornize problem when doing such batch inferencing task?

NVIDIA / TensorRT-LLM

Batch inference using Llava ModelRunner is much slower than single inference #1227

System Info

Who can help?

Information

Tasks

Reproduction