Open WeixiangXu opened 2 weeks ago
You can export onnx with ops=17, which make ln as one node.
On the other hand, usually ln in int8 data-type will greatly affected the accuracy of the model.
Also you can refer to trt-llm or ft to impl custom layer.
You can export onnx with ops=17, which make ln as one node. On the other hand, usually ln in int8 data-type will greatly affected the accuracy of the model.
@lix19937 Thanks for your reply!
I upgrade opset to 17.
However, int8 with Q/DQ nodes is still slower than fp16. (int8: 7.5 ms v.s. fp16: 6 ms)
@ttyio @zerollzeng Could you please share any thoughts you might have?
However, int8 with Q/DQ nodes is still slower than fp16. (int8: 7.5 ms v.s. fp16: 6 ms)
You can try to test a onnx which only include transpose + matmul + ln + add + relu
, then compare the latency.
My TensorRT version is 8.6.10 on Orin.
My model is Linear/LN/ReLU-like structure as below:
I add Q/DQ nodes before MatMul node to do INT8 as below:
However, INT8 is slower than FP16.
I draw the INT8 engine figure as below.
What is the best practice for quantize Linear/LN/ReLU-like structures? which takes about 50% latency in my model.