Open yuanjiechen opened 3 months ago
Reformat layer usually caused by data type switching or data memory layout adjusting.
Reformat layer usually caused by data type switching or data memory layout adjusting.
Thanks for your reply. I understand the usage of reformat layer. But in the figure, most input and output of reformat layer is same INT8 NCHW32, and I doubt about why transform NCHW32 to NCHW4 than thansform back? Compared to PTQ model trex results, their are only 20+ reformat layers, is my fault on QDQ placement?
Athother question, is the TRT 8.5 support channel-wise activation quantizatin on conv layers, for example, set axis=(1) and give the save number of input channel amax in pytorch-quantization; the model can be export to onnx, but contain error when convert to TRT. (Acturally I put the activation channel-wise QDQ before split layer)
most input and output of reformat layer is same INT8 NCHW32, and I doubt about why transform NCHW32 to NCHW4 than thansform back?
INT8 NCHW32 --> NCHW4 --> NCHW32
like follow
It just choose the best layout to concat and conv op to get best performance.
Compared to PTQ model trex results, their are only 20+ reformat layers, is my fault on QDQ placement?
Yes, can you uoload the svg file of ptq engine ?
@lix19937 Thanks for your reply. Here is ptq engine svg figure.
@yuanjiechen could you upgrade your trt version to latest 10.x and retry? thanks!
@ttyio Thanks for your reply! The nearest trt version on jetson orin which I can find is 8.6.x?
@ttyio Thanks for your reply! The nearest trt version on jetson orin which I can find is 8.6.x?
yes, let's see if 8.6 helps here. thanks!
@ttyio Thanks for your reply! My orin nx die after update to 8.6 by apt. And I reflash the disk and successfully run jetpack 6.0. But when I build the engine from same onnx used in 8.5, the engine takes much more times at trt 8.6, almost 3~4x times? And I try to use another nms method such as INMSlayer, but it contains another bugs
[08/15/2024-10:29:05] [TRT] [E] (Unnamed Layer* 1057) [Constant]:constant weights has count 1 but 0 was expected [08/15/2024-10:29:05] [TRT] [E] [network.cpp::addNMS::1636] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addNMS::1636, condition: maxOutputBoxesPerClass.getDimensions().nbDims == 0
Here is my code related to error: slice_box is slice layer output (1, 8400, 4) slice_class is slice layer output (1, 8400, 20) max_det = 300 # a integer det_pack = self.network.add_constant([1], np.array(max_det, dtype=np.int32)) iou_pack = self.network.add_constant([1], np.array([iou_thres], dtype=np.float32)) conf_pack = self.network.add_constant([1], np.array([conf_thres], dtype=np.float32))
nms_layer = self.network.add_nms(slice_box.get_output(0), slice_class.get_output(0), det_pack.get_output(0)) # -->error
nms_layer.set_input(3, iou_pack)
nms_layer.set_input(4, conf_pack)
What are the difference between 3 nms implement method in trt? (nmslayer, batchnmsplugin, efficientnmsplugin)
I test the same onnx model on another orin nx 8GB, run faster than the 16GB board 2.5x times (46 ms -> 21 ms). 即使 trt 8.6 provide 10% less reformat layer than 8.5, it's also far away from ptq
Ref ptq fusion to adjust the q-dq location.
Description
I want to finetune a quantized yolo model, and export to TRT. I carefully read the QDQ document and some existed issues to place and remove unused QDQ nodes, the model have 92% int8 precision layers but still have 70 reformat layers.
Environment
TensorRT Version: 8.5.0.2
NVIDIA GPU: agx orin 16GB
NVIDIA Driver Version:
CUDA Version: 11.4
CUDNN Version:
Operating System: Ubuntu 20.04
Python Version (if applicable): 3.8.10
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link:
model.zip Use normal trtexec command to convert onnx to engine.