NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.5k stars 818 forks source link

Does ATQ work with tensor parallelism? #472

Open theophilegervet opened 7 months ago

theophilegervet commented 7 months ago

I've been using atq.INT4_AWQ_CFG and observing a performance drop when quantizing a Llama 70B model with tensor parallelism withatq.quantize(model, quant_cfg, forward_loop=calibrate_loop).

Quantization works well through HuggingFace using pipeline parallelism though.

Is this a feature or a bug? Do we expect performance drops when using tensor parallelism with ATQ?

jdemouth-nvidia commented 7 months ago

Can you share command lines to reproduce the issue, please? And tell us your hardware configuration. Thanks!