NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.81k stars 1.01k forks source link

❗ Phi3-Visual: Incorrect outputs #2195

Closed eoastafurov closed 4 weeks ago

eoastafurov commented 2 months ago

System Info

Hello TensorRT-LLM team! 👋 I'm facing an issue where the inference output does not contain the expected "Singapore" text. Below are the details of my setup and steps to reproduce the issue.

🔧 System Information:

Who can help?

No response

Information

Tasks

Reproduction

🐳 Dockerfile:

FROM nvidia/cuda:12.5.1-devel-ubuntu22.04

RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

COPY requirements.txt .
RUN pip3 install -r requirements.txt

RUN pip3 install torchvision
RUN pip3 install flash-attn --no-build-isolation

📋 requirements.txt:

--extra-index-url https://pypi.nvidia.com
tensorrt_llm==0.13.0.dev2024090300
datasets~=2.14.5
evaluate~=0.4.1
rouge_score~=0.1.2
einops~=0.7.0
tiktoken==0.6.0

⚙️ Commands to Reproduce the Environment:

# Clone Phi weights
git-lfs clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

# Build image
docker build -f Dockerfile.bugreport -t tensorrtlm-bugreport:latest .

# Run container
docker run -v "$(pwd)":/models --gpus '"device=2"' -it tensorrtlm-bugreport:latest bash

📜 Pip Freeze Inside Container:

absl-py==2.1.0
accelerate==0.34.0
aenum==3.1.15
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
annotated-types==0.7.0
async-timeout==4.0.3
attrs==24.2.0
build==1.2.1
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
click-option-group==0.5.6
cloudpickle==3.0.0
colored==2.2.4
coloredlogs==15.0.1
cuda-python==12.6.0
datasets==2.14.7
diffusers==0.30.2
dill==0.3.7
einops==0.7.0
evaluate==0.4.2
filelock==3.15.4
flash-attn==2.6.3
frozenlist==1.4.1
fsspec==2023.10.0
h5py==3.10.0
huggingface-hub==0.24.6
humanfriendly==10.0
idna==3.8
importlib_metadata==8.4.0
janus==1.0.0
Jinja2==3.1.4
joblib==1.4.2
lark==1.2.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpi4py==4.0.0
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.3
ninja==1.11.1.1
nltk==3.9.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-modelopt==0.15.1
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
onnx==1.16.2
optimum==1.21.4
packaging==24.1
pandas==2.2.2
pillow==10.3.0
polygraphy==0.49.9
protobuf==5.28.0
psutil==6.0.0
PuLP==2.9.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pynvml==11.5.3
pyproject_hooks==1.1.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
regex==2024.7.24
requests==2.32.3
rich==13.8.0
rouge-score==0.1.2
safetensors==0.4.4
scipy==1.14.1
sentencepiece==0.2.0
six==1.16.0
StrEnum==0.4.15
sympy==1.13.2
tensorrt==10.3.0
tensorrt-cu12==10.3.0
tensorrt-cu12-bindings==10.3.0
tensorrt-cu12-libs==10.3.0
tensorrt-llm==0.13.0.dev2024090300
tiktoken==0.6.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.4.1
torchvision==0.19.1
tqdm==4.66.5
transformers==4.42.4
triton==3.0.0
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
xxhash==3.5.0
yarl==1.9.11
zipp==3.20.1

📝 Steps to Reproduce the Bug Inside the Container:

python3 /models/convert_checkpoint_phi.py \
    --model_dir /models/Phi-3-vision-128k-instruct \
    --output_dir /models/trtllm-checkpoint \
    --dtype float16

trtllm-build \
    --checkpoint_dir /models/trtllm-checkpoint  \
    --output_dir /models/trtllm-engine \
    --gpt_attention_plugin float16 \
    --gemm_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 4096 \
    --max_seq_len 4608 \
    --max_multimodal_len 4096

python3 /models/build_visual_engine.py \
    --model_type phi-3-vision \
    --model_path /models/Phi-3-vision-128k-instruct \
    --output_dir /models/trt-vision-engine \
    --max_batch_size 1

python3 /models/run.py \
    --hf_model_dir /models/Phi-3-vision-128k-instruct \
    --visual_engine_dir /models/trt-vision-engine \
    --llm_engine_dir /models/trtllm-engine \
    --image_path=https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png

Expected behavior

Outputs should contain "Singapore"

actual behavior

❗ Unexpected Output of run.py (Missing "Singapore"):

TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024090300
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024090300 found in the config file, assuming engine(s) built by new builder API.
[09/05/2024-09:34:10] [TRT-LLM] [W] Only the cogvlm is supported in C++ session for now, fallback to Python session.
[09/05/2024-09:34:10] [TRT-LLM] [I] Loading engine from /models/deletme_vision_engine/model.engine
[09/05/2024-09:34:10] [TRT-LLM] [I] Creating session from engine /models/deletme_vision_engine/model.engine
[09/05/2024-09:34:10] [TRT] [I] Loaded engine size: 827 MiB
[09/05/2024-09:34:10] [TRT] [I] [MS] Running engine with multi stream info
[09/05/2024-09:34:10] [TRT] [I] [MS] Number of aux streams is 1
[09/05/2024-09:34:10] [TRT] [I] [MS] Number of total worker streams is 2
[09/05/2024-09:34:10] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[09/05/2024-09:34:10] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1418, now: CPU 0, GPU 2243 (MiB)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[09/05/2024-09:34:13] [TRT-LLM] [W] Implicitly setting Phi3Config.original_max_position_embeddings = 4096
[09/05/2024-09:34:13] [TRT-LLM] [W] Implicitly setting Phi3Config.longrope_scaling_short_factors = [1.05, 1.05, 1.05, 1.1, 1.1, 1.1, 1.2500000000000002, 1.2500000000000002, 1.4000000000000004, 1.4500000000000004, 1.5500000000000005, 1.8500000000000008, 1.9000000000000008, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.000000000000001, 2.1000000000000005, 2.1000000000000005, 2.2, 2.3499999999999996, 2.3499999999999996, 2.3499999999999996, 2.3499999999999996, 2.3999999999999995, 2.3999999999999995, 2.6499999999999986, 2.6999999999999984, 2.8999999999999977, 2.9499999999999975, 3.049999999999997, 3.049999999999997, 3.049999999999997]
[09/05/2024-09:34:13] [TRT-LLM] [W] Implicitly setting Phi3Config.longrope_scaling_long_factors = [1.0299999713897705, 1.0499999523162842, 1.0499999523162842, 1.0799999237060547, 1.2299998998641968, 1.2299998998641968, 1.2999999523162842, 1.4499999284744263, 1.5999999046325684, 1.6499998569488525, 1.8999998569488525, 2.859999895095825, 3.68999981880188, 5.419999599456787, 5.489999771118164, 5.489999771118164, 9.09000015258789, 11.579999923706055, 15.65999984741211, 15.769999504089355, 15.789999961853027, 18.360000610351562, 21.989999771118164, 23.079999923706055, 30.009998321533203, 32.35000228881836, 32.590003967285156, 35.56000518798828, 39.95000457763672, 53.840003967285156, 56.20000457763672, 57.95000457763672, 59.29000473022461, 59.77000427246094, 59.920005798339844, 61.190006256103516, 61.96000671386719, 62.50000762939453, 63.3700065612793, 63.48000717163086, 63.48000717163086, 63.66000747680664, 63.850006103515625, 64.08000946044922, 64.760009765625, 64.80001068115234, 64.81001281738281, 64.81001281738281]
[09/05/2024-09:34:13] [TRT-LLM] [I] Set dtype to float16.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set gemm_plugin to float16.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set identity_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set nccl_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set lookup_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set lora_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set moe_plugin to auto.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set context_fmha to True.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set paged_kv_cache to True.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set remove_input_padding to True.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set reduce_fusion to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set enable_xqa to True.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set tokens_per_block to 64.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set multiple_profiles to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set paged_state to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set streamingllm to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set manage_weights to False.
[09/05/2024-09:34:13] [TRT-LLM] [I] Set use_fused_mlp to True.
[09/05/2024-09:34:13] [TRT] [I] Loaded engine size: 7389 MiB
[09/05/2024-09:34:14] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 9627 (MiB)
[09/05/2024-09:34:15] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 9627 (MiB)
[09/05/2024-09:34:15] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
[09/05/2024-09:34:15] [TRT-LLM] [I] Load engine takes: 4.397858619689941 sec
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:220: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
[09/05/2024-09:34:17] [TRT-LLM] [I] ---------------------------------------------------------
[09/05/2024-09:34:17] [TRT-LLM] [I] 
[Q] Which city is this?
[09/05/2024-09:34:17] [TRT-LLM] [I] 
[A]: ["I'm sorry, but I cannot provide specific location details such as city names."]
[09/05/2024-09:34:17] [TRT-LLM] [I] Generated 17 tokens
[09/05/2024-09:34:17] [TRT-LLM] [I] ---------------------------------------------------------

additional notes

❓ Issue Description:

The model should correctly infer that the image depicts "Singapore," specifically referring to the Merlion in the image. However, instead of "Singapore," the model outputs a generic response indicating an inability to provide location details

lfr-0531 commented 2 months ago

From the log, TensorRT-LLM works fine. This may be a model accuracy issue. Can you get the expected outputs by running the inference code on HF model page?

eoastafurov commented 2 months ago

Can you get the expected outputs by running the inference code on HF model page

By running HF example I get outputs: "The image shows a cityscape with a prominent building that resembles the Marina Bay Sands hotel in Singapore. The presence of the Merlion statue and the unique architectural style of the building suggest that this is Singapore."

eoastafurov commented 2 months ago

any updates?

Somasundaram-Palaniappan commented 2 months ago

Same issue. @eoastafurov Thanks for reporting

eoastafurov commented 2 months ago

@byshiue @kaiyux

I believe this issue might be related to image preprocessing and/or the ptuning_setup_phi3 function (link: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/multimodal_model_runner.py#L838). It seems the LLM is functioning correctly but isn't able to see the image correctly.

I tested the same example image with the input text:

""Can you describe in detail what you observe in this image? Is it white noise, a completely black screen, or something else? Please provide as much information as possible about it.""

The output I received was: The image provided appears to be a solid color with no discernible patterns, text, or objects. It is not white noise, as white noise would typically have a grainy texture and a random distribution of light and dark areas. It is also not a completely black screen, as there is no visible content or display on the screen. The color of the image is a uniform, light shade, possibly white or a very light gray, with no variation across the entire image. Without additional context or variations in the image, it is not possible to provide a more detailed description.

Can someone confirm if they've run the example from the documentation and received correct outputs?

Somasundaram-Palaniappan commented 2 months ago

same observations

lfr-0531 commented 2 months ago

Sorry for the late response, I can reproduce this issue and we are fixing it.

Somasundaram-Palaniappan commented 2 months ago

any updates ?

lfr-0531 commented 2 months ago

This is actually a TensorRT issue, still work in process.

Somasundaram-Palaniappan commented 1 month ago

Hi, Just checking to see if we have some good news

eoastafurov commented 1 month ago

any updates?

lfr-0531 commented 1 month ago

Similar to this issue: https://github.com/NVIDIA/TensorRT-LLM/issues/2369#issuecomment-2435782359

We found there are some compatibility issues between HuggingFace and torch.onnx. And we are also trying to find a workaround to resolve this from TRT-LLM side.

symphonylyh commented 4 weeks ago

see https://github.com/NVIDIA/TensorRT-LLM/issues/2369#issuecomment-2455888795. closing for now. The next week's main branch update will contain the workaround fix