TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using tensor RT-LLM backend ? If so, is there any documentation to refer?
We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using tensor RT-LLM backend ? If so, is there any documentation to refer?