NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.74k stars 999 forks source link

Is it possible load quantized model from huggingface? #2458

Open pei0033 opened 6 days ago

pei0033 commented 6 days ago

Is there any way to load a quantized model directly from huggingface and convert it to TensorRT-LLM checkpoint (or engine) without calibration? I could find some scipt of AutoGPTQ but I coundl't find other quantization method (like AutoAWQ, CompressedTensor or BNB).

Tracin commented 5 days ago

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

lodm94 commented 4 days ago

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

Even if you can quantize AWQ using nvidia-modelopt, results are quite different from AutoAWQ quantized models! I am struggling finding a way to serve a single, quantized model with several adapters in TRT-LLM. Currently, it seems you need to start adapters fine-tuning from a GPTQ model. If you fine tune adapters using awq or bnb results are completely differnt once you land on trt-llm!!

the problem is that you can't apply convert_checkpoint.py script to AutoAWQ checkpoint. You need to apply quantize.py script that start from unquantized model and go straightforward to trt checkpoint. Something is broken here, or autoawq and modelopt just use different quantization algorithms, leading to different results once you apply the fine-tuned adapter