NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

The llava model batch inference result is different with batch=1 #1844

Open lss15151161 opened 3 days ago

lss15151161 commented 3 days ago

System info

GPU: A100 tensorrt 9.3.0.post12.dev1 tensorrt-llm 0.9.0 torch 2.2.2

Reproduction

export MODEL_NAME="llava-1.5-7b-hf"
git clone https://huggingface.co/llava-hf/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
python ../llama/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --dtype float16

trtllm-build \
    --checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --gemm_plugin float16 \
    --use_fused_mlp \
    --max_batch_size 16 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --max_multimodal_len 9216 # 1 (max_batch_size) * 576 (num_visual_features)

python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type llava # or "--model_type vila" for VILA

python run.py \
    --max_new_tokens 20 \
    --hf_model_dir tmp/hf_models/${MODEL_NAME} \
    --visual_engine_dir visual_engines/${MODEL_NAME} \
    --llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
    --decoder_llm \
    --input_text "Question: which city is this? Answer:"
    --batch_size 16

if I use the same data to form a batch,the result like this:

image

and if I use two different prompt to form a batch,the reslt like this:

image image

The image used is : https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png

TheCodeWrangler commented 2 days ago

I saw similar results with llama3. Mine was resolved when I disabled 'use_custom_all_reduce' in compilation

hijkzzz commented 2 days ago

Could you try the latest versoin TRT_LLM 0.11+ https://nvidia.github.io/TensorRT-LLM/installation/linux.html