NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 906 forks source link

Batch inference using Llava ModelRunner is much slower than single inference #1227

Open spoonbobo opened 6 months ago

spoonbobo commented 6 months ago

System Info

GPU: a10g

Who can help?

@kaiyux

Information

Tasks

Reproduction

  1. batch_size=4`, following https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava
    `trtllm-build \
    --checkpoint_dir models/trt_${MODEL_NAME}/fp16/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/int4_weightonly/1-gpu \
    --gpt_attention_plugin float16 \
    --gemm_plugin float16 \
    --max_batch_size 4 \
    --max_input_len 924 \
    --max_output_len 100 \
    --max_multimodal_len 2304`
  2. generate results using run.py:
    
    self.model = ModelRunner.from_dir(self.args.llm_engine_dir,
                                              rank=tensorrt_llm.mpi_rank(),
                                              debug_mode=True)
            self.model_config = self.model.session._model_config
            ic(self.model_config)

output_ids = self.model.generate( input_ids.to("cpu"), sampling_config=None, prompt_table_path='prompt_table.npy', max_new_tokens=max_new_tokens, end_id=end_id, pad_id=self.tokenizer.pad_token_id, top_k=self.args.top_k, num_beams=self.args.num_beams, output_sequence_lengths=False, return_dict=False)



### Expected behavior

It should only take around 1 sec to complete, On single inference, it's around `0.30s`.

### actual behavior

It took significantly longer time to complete a batch with size of 4. Instead `self.model.generate` took `43` seconds.

### additional notes

please guide if there is any recommendations. Would there be any issue in `tensorrt_llm.runtime.ModelRunner`, or potential CUDAStreamSynchornize problem when doing such batch inferencing task?
lss15151161 commented 2 months ago

hi, do you try to use different input to form a batch and infer? I use different inputs, and the result is wrong. only the longest prompt's result is right