NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.85k stars 2.14k forks source link

How to quantize Linear/LN/ReLU-like structures with int8. #4242

Open WeixiangXu opened 2 weeks ago

WeixiangXu commented 2 weeks ago

My TensorRT version is 8.6.10 on Orin.

My model is Linear/LN/ReLU-like structure as below: Image

I add Q/DQ nodes before MatMul node to do INT8 as below: Image

However, INT8 is slower than FP16.

I draw the INT8 engine figure as below. Image

What is the best practice for quantize Linear/LN/ReLU-like structures? which takes about 50% latency in my model.

lix19937 commented 2 weeks ago

You can export onnx with ops=17, which make ln as one node.
On the other hand, usually ln in int8 data-type will greatly affected the accuracy of the model.

lix19937 commented 2 weeks ago

Also you can refer to trt-llm or ft to impl custom layer.

WeixiangXu commented 1 week ago

You can export onnx with ops=17, which make ln as one node. On the other hand, usually ln in int8 data-type will greatly affected the accuracy of the model.

@lix19937 Thanks for your reply!

I upgrade opset to 17. Image

However, int8 with Q/DQ nodes is still slower than fp16. (int8: 7.5 ms v.s. fp16: 6 ms)

WeixiangXu commented 1 week ago

@ttyio @zerollzeng Could you please share any thoughts you might have?

lix19937 commented 1 week ago

However, int8 with Q/DQ nodes is still slower than fp16. (int8: 7.5 ms v.s. fp16: 6 ms)

You can try to test a onnx which only include transpose + matmul + ln + add + relu, then compare the latency.