Is it possible load quantized model from huggingface?

pei0033 commented 6 days ago

Is there any way to load a quantized model directly from huggingface and convert it to TensorRT-LLM checkpoint (or engine) without calibration? I could find some scipt of AutoGPTQ but I coundl't find other quantization method (like AutoAWQ, CompressedTensor or BNB).

Tracin commented 5 days ago

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

lodm94 commented 4 days ago

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

Even if you can quantize AWQ using nvidia-modelopt, results are quite different from AutoAWQ quantized models! I am struggling finding a way to serve a single, quantized model with several adapters in TRT-LLM. Currently, it seems you need to start adapters fine-tuning from a GPTQ model. If you fine tune adapters using awq or bnb results are completely differnt once you land on trt-llm!!

the problem is that you can't apply convert_checkpoint.py script to AutoAWQ checkpoint. You need to apply quantize.py script that start from unquantized model and go straightforward to trt checkpoint. Something is broken here, or autoawq and modelopt just use different quantization algorithms, leading to different results once you apply the fine-tuned adapter

NVIDIA / TensorRT-LLM

Is it possible load quantized model from huggingface? #2458