Closed leeeizhang closed 4 months ago
@leeeizhang your ONNX graph looks correct to me. TRT will recognize the Q/DQ layers before matmul and replace it with quantized matmul (i.e, in TRT input and weights are quantized to int8 and the int8 GEMM will be used for matmul)
@leeeizhang your ONNX graph looks correct to me. TRT will recognize the Q/DQ layers before matmul and replace it with quantized matmul (i.e, in TRT input and weights are quantized to int8 and the int8 GEMM will be used for matmul)
LGTM! Many thanks!
I used the modelopt to quantize my models into the int8 ONNX model. However, when I visualize its ONNX graph, I am not sure whether it is computing in full precision or int8 precision.
It seems like the input and weights are quantized into int8 to store in GPU memory. But before the
matmul
operations, these inputs and weights are still dequantized into full precision (e.g., fp32) for computing. Correct me if I am wrong.I also profiled its execution runtime, and the kernel names are:
sm80_xmma_gemm_i8f32_i8i32_f32_tn_n_tilesize128x128x64_stage3_warpsize2x2x1_tensor16x8x32_execute_kernel_trt
sm80_xmma_gemm_f32f32_tf32f32_f32_nn_n_tilesize64x128x16_stage4_warpsize2x2x1_tensor16x8x8_execute_kernel_trt
sm80_xmma_gemm_i8f32_i8i32_f32_tn_n_tilesize128x128x64_stage3_warpsize2x2x1_tensor16x8x32_fused
So what do
i8f32
andi8i32
mean? Does it mean theint8
weights/inputs are converted intof32
orint32
?