Closed eycheung closed 9 months ago
Thanks @byshiue! Sorry I misunderstood what this does and see that there is indeed a specialized GEMM for smoothquant being set
network.plugin_config.set_smooth_quant_gemm_plugin(dtype=args.dtype)
As a follow-up, my benchmark for SmoothQuant int8 seems slower than weight-only int8 for Llama2, which seems surprising to me then if SQ is indeed using a specialized int8 kernel. Is this expected, or is this due to my build options (maybe it could be faster if I only did per_tensor?) or hardware (I am using g5 instances with A10s).
Could you share the details of your benchmark? Also, SQ is only faster than weight only when batch size is large enough.
It'd be nice if you could share the commands to reproduce the issue, indeed. However, that's not necessarily surprising. SmoothQuant (SQ) requires a bit of extra work to be performed (like the smoothing of activations). Both INT8 W/O and INT8 SQ work with INT8 weights and if the performance is limited by weight-loading (or KV cache loading), it won't make a huge difference if the activations are in INT8 (unique advantage of SQ in terms of runtime perf).
Gotcha, thank you both! I have only done very naive benchmarking so far, e.g. just loop through a list and measure token metrics at various batch sizes. I'll check to see when the cross-over point will happen where SQ might outperform W/O. It makes sense that since this is memory bound, that the optimizations would only improve once the batch becomes large enough.
Thanks again, and I'll close this as resolved.
Thanks @byshiue! Sorry I misunderstood what this does and see that there is indeed a specialized GEMM for smoothquant being set
network.plugin_config.set_smooth_quant_gemm_plugin(dtype=args.dtype)
As a follow-up, my benchmark for SmoothQuant int8 seems slower than weight-only int8 for Llama2, which seems surprising to me then if SQ is indeed using a specialized int8 kernel. Is this expected, or is this due to my build options (maybe it could be faster if I only did per_tensor?) or hardware (I am using g5 instances with A10s).
self.smooth_quant_gemm_plugin = "int8"
I set def set_smooth_quant_plugins(self, dtype: str = "auto"):
self.smooth_quant_gemm_plugin = "int8"
self.rmsnorm_quantization_plugin = dtype
self.layernorm_quantization_plugin = dtype
self.quantize_per_token_plugin = True
self.quantize_tensor_plugin = True
return self
but error:
[08/16/2024-08:54:52] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to int8.
[08/16/2024-08:54:52] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16.
[08/16/2024-08:54:52] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16.
[08/16/2024-08:54:52] [TRT-LLM] [I] Set quantize_per_token_plugin to True.
[08/16/2024-08:54:52] [TRT-LLM] [I] Set quantize_tensor_plugin to True.
[08/16/2024-08:54:52] [TRT-LLM] [I] Set nccl_plugin to None.
[08/16/2024-08:54:52] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[08/16/2024-08:54:52] [TRT] [W] IElementWiseLayer with inputs QWenForCausalLM/transformer/layers/0/attention/qkv/smooth_quant_gemm/PLUGIN_V2_SmoothQuantGemm_0_output_0 and QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0_output_0: first input has type Int8 but second input has type Half.
[08/16/2024-08:54:52] [TRT] [E] ITensor::getDimensions: Error Code 4: Internal Error (QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types Int8 and Half.)
Traceback (most recent call last):
File "/root/anaconda3/envs/trt_llm/bin/trtllm-build", line 8, in
Will there be any plans to support INT8 GEMM? In the SmoothQuant paper it seems like one of the main benefits is that by quantizing both weights and activations, we can use specific integer kernels.
However, it seems like most of the TRT-LLM builds only support
['float16', 'bfloat16', 'float32']
for the GEMM plugin.