TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
git clone -b main https://github.com/triton-inference-server/tensorrtllm_backend.git
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
Traceback (most recent call last):
File "/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 839, in <module>
build(0, args)
File "/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 783, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py", line 710, in build_rank_engine
tensorrt_llm_llama(*inputs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
return self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 379, in forward
hidden_states = super().forward(input_ids, position_ids, use_cache,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 260, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
return self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 117, in forward
attention_output = self.attention(hidden_states,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
return self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 1103, in forward
context, past_key_value = gpt_attention(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py", line 564, in wrapper
outs = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3548, in gpt_attention
plug_inputs = [i.trt_tensor for i in plug_inputs]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3548, in <listcomp>
plug_inputs = [i.trt_tensor for i in plug_inputs]
AttributeError: 'NoneType' object has no attribute 'trt_tensor'
I used the following steps to build SQ engine
First, build docker image from main branch
Then build engines using the above image
I got the following error when I run build.py