TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
I've been using atq.INT4_AWQ_CFG and observing a performance drop when quantizing a Llama 70B model with tensor parallelism withatq.quantize(model, quant_cfg, forward_loop=calibrate_loop).
Quantization works well through HuggingFace using pipeline parallelism though.
Is this a feature or a bug? Do we expect performance drops when using tensor parallelism with ATQ?
I've been using
atq.INT4_AWQ_CFG
and observing a performance drop when quantizing a Llama 70B model with tensor parallelism withatq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
.Quantization works well through HuggingFace using pipeline parallelism though.
Is this a feature or a bug? Do we expect performance drops when using tensor parallelism with ATQ?