TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
### Expected behavior
It should only take around 1 sec to complete, On single inference, it's around `0.30s`.
### actual behavior
It took significantly longer time to complete a batch with size of 4. Instead `self.model.generate` took `43` seconds.
### additional notes
please guide if there is any recommendations. Would there be any issue in `tensorrt_llm.runtime.ModelRunner`, or potential CUDAStreamSynchornize problem when doing such batch inferencing task?
hi, do you try to use different input to form a batch and infer? I use different inputs, and the result is wrong. only the longest prompt's result is right
System Info
GPU: a10g
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
run.py
:output_ids = self.model.generate( input_ids.to("cpu"), sampling_config=None, prompt_table_path='prompt_table.npy', max_new_tokens=max_new_tokens, end_id=end_id, pad_id=self.tokenizer.pad_token_id, top_k=self.args.top_k, num_beams=self.args.num_beams, output_sequence_lengths=False, return_dict=False)