TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
System Info
Nvidia A10GPU, Databricks
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Followed the steps in
/examples/llama
to build the engine. Inference does work, tested with../run.py
However using the high level python api doesn't work when initialising from built engine.
Using version
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
Expected behavior
High level API should load config
actual behavior
does not load.
additional notes