TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
TensorRT-LLM/examples/bloom# python3 convert_checkpoint.py --model_dir ./bloom/560M/ \
--dtype float16 \
--output_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/
Running /TensorRT-LLM/examples/bloom/convert_checkpoint.py
Traceback (most recent call last):
File "/TensorRT-LLM/examples/bloom/convert_checkpoint.py", line 26, in <module>
from tensorrt_llm.models.llama.utils import iterate_shard_files, load_state_dict #TODO: move the utils to common dir shared by models
ModuleNotFoundError: No module named 'tensorrt_llm.models.llama.utils'
Single GPU on BLOOM 560M
python convert_checkpoint.py --model_dir ./bloom/560M/ \ --dtype float16 \ --output_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/