Open pei0033 opened 6 days ago
I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.
I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.
Even if you can quantize AWQ using nvidia-modelopt, results are quite different from AutoAWQ quantized models! I am struggling finding a way to serve a single, quantized model with several adapters in TRT-LLM. Currently, it seems you need to start adapters fine-tuning from a GPTQ model. If you fine tune adapters using awq or bnb results are completely differnt once you land on trt-llm!!
the problem is that you can't apply convert_checkpoint.py script to AutoAWQ checkpoint. You need to apply quantize.py script that start from unquantized model and go straightforward to trt checkpoint. Something is broken here, or autoawq and modelopt just use different quantization algorithms, leading to different results once you apply the fine-tuned adapter
Is there any way to load a quantized model directly from huggingface and convert it to TensorRT-LLM checkpoint (or engine) without calibration? I could find some scipt of AutoGPTQ but I coundl't find other quantization method (like AutoAWQ, CompressedTensor or BNB).