TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Now I was trying to test the engine using run.py from examples directory as:
python3 /opt/tritonserver/TensorRT/TensorRT-LLM/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 128 --tokenizer_dir /opt/tritonserver/TensorRT/model/llama-2-7b --input_text "What is ML" --streaming --streaming_interval 2 --temperature 0.7 --top_k 3 --top_p 0.9
I am facing 2 issues:
Issue 1: ImportError: cannot import name 'supports_inflight_batching' from 'tensorrt_llm._utils'
File "/opt/tritonserver/TensorRT_LLM_RB/TensorRT-LLM/examples/run.py", line 25, in
from utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
File "/opt/tritonserver/TensorRT_LLM_RB/TensorRT-LLM/examples/utils.py", line 26, in
from tensorrt_llm._utils import supports_inflight_batching # noqa
ImportError: cannot import name 'supports_inflight_batching' from 'tensorrt_llm._utils' (/usr/local/lib/python3.10/dist-
packages/tensorrt_llm/_utils.py)
I tried to fix this by copying the tensor_llm folder inside the examples folder and it resolved this issue
Issue 2: After fixing issue 1, a new error occured when i ran the run.py again!
File "/opt/tritonserver/TensorRT_LLM_RB/TensorRT-LLM/examples/tensorrt_llm/_utils.py", line 31, in
from tensorrt_llm.bindings import GptJsonConfig
ModuleNotFoundError: No module named 'tensorrt_llm.bindings'
For issue 2, I am not able to find bindings folder too in tensorrt_llm. I am not sure what is wrong!
If convert_checkpoint.py from examples directory works fine without causing issue 1, then why is run.py throwing this error?
I have successfully built and started docker container for tensorrt_llm and ran the convert_checkpoints.py as well as trtllm_build as follows:
docker run -it --net host --shm-size=4g --name triton_llm --ulimit memlock=-1 --ulimit stack=67108864 --gpus '"device=1"' -v ~/shared_folder/TensorRT:/opt/tritonserver/TensorRT nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --remove_input_padding enable \ --gpt_attention_plugin float16 \ --context_fmha enable \ --gemm_plugin float16 \ --output_dir ${ENGINE_DIR} \ --paged_kv_cache enable \ --max_batch_size 8
Now I was trying to test the engine using run.py from examples directory as: python3 /opt/tritonserver/TensorRT/TensorRT-LLM/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 128 --tokenizer_dir /opt/tritonserver/TensorRT/model/llama-2-7b --input_text "What is ML" --streaming --streaming_interval 2 --temperature 0.7 --top_k 3 --top_p 0.9
I am facing 2 issues:
Issue 1: ImportError: cannot import name 'supports_inflight_batching' from 'tensorrt_llm._utils'
File "/opt/tritonserver/TensorRT_LLM_RB/TensorRT-LLM/examples/run.py", line 25, in
from utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
File "/opt/tritonserver/TensorRT_LLM_RB/TensorRT-LLM/examples/utils.py", line 26, in
from tensorrt_llm._utils import supports_inflight_batching # noqa
ImportError: cannot import name 'supports_inflight_batching' from 'tensorrt_llm._utils' (/usr/local/lib/python3.10/dist-
packages/tensorrt_llm/_utils.py)
I tried to fix this by copying the tensor_llm folder inside the examples folder and it resolved this issue
Issue 2: After fixing issue 1, a new error occured when i ran the run.py again! File "/opt/tritonserver/TensorRT_LLM_RB/TensorRT-LLM/examples/tensorrt_llm/_utils.py", line 31, in
from tensorrt_llm.bindings import GptJsonConfig
ModuleNotFoundError: No module named 'tensorrt_llm.bindings'
For issue 2, I am not able to find bindings folder too in tensorrt_llm. I am not sure what is wrong! If convert_checkpoint.py from examples directory works fine without causing issue 1, then why is run.py throwing this error?