NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Does tensorRT-LLM support serving 4bit quantised unsloth Llama model #2472

Open jayakommuru opened 1 day ago

jayakommuru commented 1 day ago

We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using tensor RT-LLM backend ? If so, is there any documentation to refer?

Tracin commented 1 day ago

Sorry, can not support that for now.

jayakommuru commented 1 day ago

@Tracin is it because of the nf4 quantization of bitsandbytes used in the model ?