NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.44k stars 954 forks source link

Support int type zero-points in weight-only GEMM #1922

Open xiaonans opened 3 months ago

xiaonans commented 3 months ago

Currently some quantized huggingface models save zero-points in int4 datatype directly, like Qwen/Qwen2-7B-Instruct-GPTQ-Int4 and Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face.

But the weight_only_groupwise_quant_matmul in TensorRT-LLM only support fp16 zero-points as input, thus causing a data type conversion like https://github.com/NVIDIA/TensorRT-LLM/blob/a96cccafcf6365c128f004f779160951f8c0801c/tensorrt_llm/models/qwen/weight.py#L104.

For groupwise quantization, memory cost of zero-points is not neglected. Would you pls add int type zero-points support in the weight-only GEMM?

QiJune commented 3 months ago

@Tracin Could you please have a look? Thanks

Tracin commented 3 months ago

@xiaonans For group_size let's say 128, means every 128 4bit weights will have one fp16 zero_point. Memory ratio of zp / weight is 16 / (128 * 4). That's about 3% if my calculation is correct. I think it is neglected. On the other hand, dequantize zp in kernel will bring overhead. Do you agree with me?

xiaonans commented 3 months ago

@xiaonans For group_size let's say 128, means every 128 4bit weights will have one fp16 zero_point. Memory ratio of zp / weight is 16 / (128 * 4). That's about 3% if my calculation is correct. I think it is neglected.

If group_size=64 and 2bit weights are used, the memory ratio of zero-point/weight can be about 12%, 16/(64*2).

On the other hand, dequantize zp in kernel will bring overhead. Do you agree with me?

If the fpA_intB_gemm kernel is able to load int4 zp directly from the global memory, overhead of this loading will be reduced compared with fp16 zp. When dealing with memory-bound scenarios, it should bring in acceleration.

shaoyanguo commented 2 months ago

@xiaonans How to deploy this model(Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face.) on TensorRT-LLM? because TensorRT-LLM only support fp16 transformer models. Thank you!