Open xiaonans opened 3 months ago
@Tracin Could you please have a look? Thanks
@xiaonans For group_size let's say 128, means every 128 4bit weights will have one fp16 zero_point. Memory ratio of zp / weight is 16 / (128 * 4). That's about 3% if my calculation is correct. I think it is neglected. On the other hand, dequantize zp in kernel will bring overhead. Do you agree with me?
@xiaonans For group_size let's say 128, means every 128 4bit weights will have one fp16 zero_point. Memory ratio of zp / weight is 16 / (128 * 4). That's about 3% if my calculation is correct. I think it is neglected.
If group_size=64 and 2bit weights are used, the memory ratio of zero-point/weight can be about 12%, 16/(64*2).
On the other hand, dequantize zp in kernel will bring overhead. Do you agree with me?
If the fpA_intB_gemm kernel is able to load int4 zp directly from the global memory, overhead of this loading will be reduced compared with fp16 zp. When dealing with memory-bound scenarios, it should bring in acceleration.
@xiaonans How to deploy this model(Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face.) on TensorRT-LLM? because TensorRT-LLM only support fp16 transformer models. Thank you!
Currently some quantized huggingface models save zero-points in int4 datatype directly, like Qwen/Qwen2-7B-Instruct-GPTQ-Int4 and Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face.
But the weight_only_groupwise_quant_matmul in TensorRT-LLM only support fp16 zero-points as input, thus causing a data type conversion like https://github.com/NVIDIA/TensorRT-LLM/blob/a96cccafcf6365c128f004f779160951f8c0801c/tensorrt_llm/models/qwen/weight.py#L104.
For groupwise quantization, memory cost of zero-points is not neglected. Would you pls add int type zero-points support in the weight-only GEMM?