[Bug] Error while converting multimodal Phi 3 Vision model to TRT-LLM checkpoints

monoclex commented 1 month ago

System Info

n1-standard-16 from GCP with 4x NVIDIA T4s
nvidia-smi: NVIDIA-SMI 550.54.15, Driver Version: 550.54.15, CUDA Version: 12.4
Using a NVIDIA GPU Optimized base image

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

When trying to convert microsoft/Phi-3-vision-128k-instruct to a TRT-LLM checkpoint with the instructions found in examples/multimodal, I run into an undefined symbol error, that I believe is the result of some mismatched version somewhere down along the chain of dependencies.

Below is the set of steps I followed to get to this point:

# Install buildx since the GCP provisioned machines don't have it by default
sudo apt-get install -y docker-buildx-plugin

# Steps from https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/docs/source/installation.md
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout rel # Use a stable version to avoid errors, 0.11.0 at the time of writing
git submodule update --init --recursive
git lfs install
git lfs pull

# Build TensorRT-LLM
make -C docker build

# Run TensorRT-LLM
make -C docker run

# At this point we're inside the container

# Go to the multimodal vision example
cd examples/multimodal

# Download the model
export MODEL_NAME="Phi-3-vision-128k-instruct"
git clone https://huggingface.co/microsoft/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}

sudo apt-get install -y git-lfs
(cd tmp/hf_models/${MODEL_NAME} && git lfs fetch)

# Install dependencies
pip install -r ../gpt/requirements.txt

export MODEL_NAME="Phi-3-vision-128k-instruct"
python ../gpt/convert_checkpoint.py \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
    --dtype float16

Expected behavior

I expect to convert the Phi 3 vision model to a TRT-LLM checkpoint.

actual behavior

Upon attempting to convert the model to a TRT-LLM checkpoint, I get the following error:

$ python ../gpt/convert_checkpoint.py \                                                                                     
    --model_dir tmp/hf_models/${MODEL_NAME} \                                                                             
    --output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \                                                                
    --dtype float16                                                                                                                                                                                                                                 
Traceback (most recent call last):                                                                                        
  File "/code/tensorrt_llm/examples/multimodal/../gpt/convert_checkpoint.py", line 26, in <module>                        
    import tensorrt_llm                                                                                                   
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>                           
    import tensorrt_llm.functional as functional                                                                          
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 28, in <module>                         
    from . import graph_rewriting as gw                                                                                   
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py", line 12, in <module>                    
    from .network import Network                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/network.py", line 27, in <module>                            
    from tensorrt_llm.module import Module                                                                                
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 17, in <module>                             
    from ._common import default_net                                                                                      
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 31, in <module>                            
    from ._utils import str_dtype_to_trt                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 30, in <module>                             
    from tensorrt_llm.bindings.BuildInfo import ENABLE_MULTI_DEVICE                                                       
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs

additional notes

I also tried installing the requirements found in examples/phi/requirements.txt to no avail. Attached to this issue is the output of pip freeze and apt installed --list:

pip_freeze.txt

apt_list.txt

QiJune commented 1 month ago

Hi @monoclex , I think you should try python ../phi/convert_checkpoint.py

monoclex commented 1 month ago

@QiJune I forgot to have mentioned it, but I believe I tried that as well similarly to no avail. If it works for you, let me know how!

lkc-fp commented 1 month ago

@byshiue Can you confirm if this is a bug? Are there any ideas to get around this in the meantime?

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

NVIDIA / TensorRT-LLM