Fake quantization ONNX model parse ERROR using TensorRT7.2

ShiinaMitsuki commented 3 years ago

Description

Error occurred parsing fake quantization ONNX model using TensorRT7.2.1.6 following the guidance of pytorch-quantization toolbox provided in TensorRT7.2 release.

Error Message:

Loading ONNX file from path checkpoints/rfdn_asx4_nf64nm2inc3_calibrated_op10.onnx...
Beginning ONNX file parsing
[TensorRT] ERROR: QuantizeLinear_7_quantize_scale_node: shift weights has count 64 but 3 was expected
[TensorRT] ERROR: QuantizeLinear_7_quantize_scale_node: shift weights has count 64 but 3 was expected
[TensorRT] ERROR: QuantizeLinear_7_quantize_scale_node: shift weights has count 64 but 3 was expected
ERROR: Failed to parse the ONNX file.
In node 8 (importDequantizeLinear): INVALID_NODE: Assertion failed: K == scale.count()
Traceback (most recent call last):
  File "qaonnx2trt.py", line 65, in <module>
    with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
AttributeError: __enter__

Environment

TensorRT Version: 7.2.1.6 GPU Type: NVIDIA RTX 2070 Nvidia Driver Version: 440.33.01 CUDA Version: CUDA 10.2 CUDNN Version: CUDNN 8.0 Operating System + Version: Ubuntu 1604 Python Version (if applicable): 3.6.12 TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.6.0 Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.) onnx model code

Steps To Reproduce

Please include:

Download onnx model and code to disk
use the code to build tensorrt engine from onnx model

ERROR:

root@sobey:/project/tensorrt-quantize/test# PYTHONPATH=../ python qaonnx2trt.py 
Loading ONNX file from path checkpoints/rfdn_asx4_nf64nm2inc3_calibrated_op10.onnx...
Beginning ONNX file parsing
[TensorRT] ERROR: QuantizeLinear_7_quantize_scale_node: shift weights has count 64 but 3 was expected
[TensorRT] ERROR: QuantizeLinear_7_quantize_scale_node: shift weights has count 64 but 3 was expected
[TensorRT] ERROR: QuantizeLinear_7_quantize_scale_node: shift weights has count 64 but 3 was expected
ERROR: Failed to parse the ONNX file.
In node 8 (importDequantizeLinear): INVALID_NODE: Assertion failed: K == scale.count()
Traceback (most recent call last):
  File "qaonnx2trt.py", line 65, in <module>
    with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
AttributeError: __enter__

ttyio commented 3 years ago

Hello @ShiinaMitsuki , thanks for reporting. The full support for the onnx model exported from pytorch-quantization tool then import into ONNX-trt will be available in next major release. before that we have to use setDynamicRange to import ONNX int8 network. there is a sample DemoBERT use this method: see load_onnx_weights_and_quant in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L478 see set_dynamic_range in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L113

ShiinaMitsuki commented 3 years ago

Hello @ShiinaMitsuki , thanks for reporting. The full support for the onnx model exported from pytorch-quantization tool then import into ONNX-trt will be available in next major release. before that we have to use setDynamicRange to import ONNX int8 network. there is a sample DemoBERT use this method: see load_onnx_weights_and_quant in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L478 see set_dynamic_range in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L113

Thanks for the reply.

How's the ONNX exported from pytorch after fake quantization using pytorch-quantization package?

I followed the guidance in: https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html#export-to-onnx and paste the code snippet into torch/onnx/symbolic_opset10.py , then export my model using

import torch
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import quant_modules

quant_nn.TensorQuantizer.use_fb_fake_quant = True
quant_modules.initialize()

from test.models.rfdn import RFDN_ASX4_nf64m2

model = RFDN_ASX4_nf64m2()

calibrated_model = 'checkpoints/rfdn_asx4_nf64nm2inc3_calibrated.pt'
onnx_save_path = calibrated_model.replace('.pt', '_op10.onnx')

state_dict = torch.load(calibrated_model, map_location='cpu')
model.load_state_dict(state_dict)
model.cuda()
model.eval()

dummy_input = torch.zeros(1, 3, 2160, 3840, requires_grad=False).cuda()

torch.set_grad_enabled(False)
# enable_onnx_checker needs to be disabled.
torch.onnx.export(model,
                  dummy_input,
                  onnx_save_path,
                  verbose=True,
                  input_names=['input'],
                  output_names=['output'],
                  opset_version=10,
                  enable_onnx_checker=False)

I compared my exported ONNX model with the BERT model(bert_large_v1_1_fake_quant.onnx) by print the weight name

model = onnx.load(path)
# print(onnx.helper.printable_graph(model.graph))
weights = model.graph.initializer
for w in weights:
    print(w.name)

It's quite different .

below is from my model:

643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
B1.c1_d.bias
B1.c1_d.weight
B1.c1_r.bias
B1.c1_r.weight
B1.c2_d.bias
B1.c2_d.weight
B1.c2_r.bias
B1.c2_r.weight
B1.c3_d.bias
B1.c3_d.weight
B1.c3_r.bias
B1.c3_r.weight
B1.c4.bias
B1.c4.weight
B1.c5.bias
B1.c5.weight
B1.esa.conv1.bias
B1.esa.conv1.weight
B1.esa.conv2.bias
B1.esa.conv2.weight
B1.esa.conv3.bias
B1.esa.conv3.weight
B1.esa.conv3_.bias
B1.esa.conv3_.weight
B1.esa.conv4.bias
B1.esa.conv4.weight
B1.esa.conv_f.bias
B1.esa.conv_f.weight
B1.esa.conv_max.bias
B1.esa.conv_max.weight
B2.c1_d.bias
B2.c1_d.weight
B2.c1_r.bias
B2.c1_r.weight
B2.c2_d.bias
B2.c2_d.weight
B2.c2_r.bias
B2.c2_r.weight
B2.c3_d.bias
B2.c3_d.weight
B2.c3_r.bias
B2.c3_r.weight
B2.c4.bias
B2.c4.weight
B2.c5.bias
B2.c5.weight
B2.esa.conv1.bias
B2.esa.conv1.weight
B2.esa.conv2.bias
B2.esa.conv2.weight
B2.esa.conv3.bias
B2.esa.conv3.weight
B2.esa.conv3_.bias
B2.esa.conv3_.weight
B2.esa.conv4.bias
B2.esa.conv4.weight
B2.esa.conv_f.bias
B2.esa.conv_f.weight
B2.esa.conv_max.bias
B2.esa.conv_max.weight
LR_conv3.bias
LR_conv3.weight
c3.0.bias
c3.0.weight
fea_conv.0.bias
fea_conv.0.weight
fea_conv.1.bias
fea_conv.1.weight
upsamplerx4.0.bias
upsamplerx4.0.weight

and part of Bert,

bert.embeddings.LayerNorm.bias
bert.embeddings.LayerNorm.weight
bert.embeddings.position_embeddings._weight_quantizer._amax
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings._weight_quantizer._amax
bert.embeddings.token_type_embeddings.weight
bert.embeddings.word_embeddings._weight_quantizer._amax
bert.embeddings.word_embeddings.weight
bert.encoder.final_input_quantizer._amax
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.add_local_input_quantizer._amax
bert.encoder.layer.0.attention.output.add_residual_input_quantizer._amax
...

there's no _quantizer._amax like name from my model, don't known why.

ttyio commented 3 years ago

Hello @ShiinaMitsuki , a_max is only computed after QAT training, and we need fake_quant to do this, also we need train and compute amax; we need fb_fake_quant when export to onnx, so call quant_nn.TensorQuantizer.use_fb_fake_quant = True before export to ONNX instead of at the begining. please follow this sample for more details https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html.

JosephChenHub commented 3 years ago

hi @ttyio , I also meet the same issue, could you help me ? Here is the graph: Screenshot from 2021-01-11 17-46-38 and the log is

ttyio commented 3 years ago

Hello @JosephChenHub , currently the opensourced pytorch-quantization can generated ONNX, but the importer from onnx with Q/DQ nodes to TRT is not ready in 7.x. The full support will be available in next major release. before that we have to use setDynamicRange to import ONNX int8 network.

There is a sample DemoBERT use this method: see load_onnx_weights_and_quant in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L478 see set_dynamic_range in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L113

JosephChenHub commented 3 years ago

Hello @JosephChenHub , currently the opensourced pytorch-quantization can generated ONNX, but the importer from onnx with Q/DQ nodes to TRT is not ready in 7.x. The full support will be available in next major release. before that we have to use setDynamicRange to import ONNX int8 network.

There is a sample DemoBERT use this method: see load_onnx_weights_and_quant in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L478 see set_dynamic_range in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L113

Do you mean that we can manually set the dynamic range of each tensor via reading the scale after QAT ?

ttyio commented 3 years ago

Hello @JosephChenHub , currently the opensourced pytorch-quantization can generated ONNX, but the importer from onnx with Q/DQ nodes to TRT is not ready in 7.x. The full support will be available in next major release. before that we have to use setDynamicRange to import ONNX int8 network. There is a sample DemoBERT use this method: see load_onnx_weights_and_quant in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L478 see set_dynamic_range in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L113

Do you mean that we can manually set the dynamic range of each tensor via reading the scale after QAT ?

Yes, we can load the amax from ONNX, set the per tensor activation scale using setDynamicRange , and the per channel scale for weights is automatically set in TensorRT.

maoxiaoming86 commented 3 years ago

Hello @ShiinaMitsuki , a_max is only computed after QAT training, and we need fake_quant to do this, also we need train and compute amax; we need fb_fake_quant when export to onnx, so call quant_nn.TensorQuantizer.use_fb_fake_quant = True before export to ONNX instead of at the begining. please follow this sample for more details https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html.

Hello @ShiinaMitsuki, Do you produce '_quantizer._amax' sucessfully? I am still failed

Ricardosuzaku commented 3 years ago

Hello @ShiinaMitsuki , thanks for reporting. The full support for the onnx model exported from pytorch-quantization tool then import into ONNX-trt will be available in next major release. before that we have to use setDynamicRange to import ONNX int8 network. there is a sample DemoBERT use this method: see load_onnx_weights_and_quant in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L478 see set_dynamic_range in https://github.com/NVIDIA/TensorRT/blob/release/7.2/demo/BERT/builder.py#L113

Hi, @ttyio Can TensorRT 8.0 import pytorch-quantization onnx model, I mean if I can parse the onnx model and bulid available trt engine?

ttyio commented 3 years ago

@Ricardosuzaku yes.

ttyio commented 3 years ago

Closing since now activity for more than 3 weeks, please reopen if you still have question, thanks!

k9ele7en commented 3 years ago

Hi @ttyio , I upgraded TensorRT into v8.0.1.6 (GA), but the error on top of the topic still appear, I am not sure that RT version did not run in v8 yet or something. Can you give some idea? Thank you.

ttyio commented 3 years ago

Hi @k9ele7en , how did you generate the onnx? are you using trtexec or other tool to run the onnx? thanks

k9ele7en commented 3 years ago

Thanks for your response, I wrote explicit code to convert ONNX into RT engine, already used set_flag to use INT8 for quantized model...

ttyio commented 3 years ago

Hello @k9ele7en , is your onnx generated using pytorch-quantization toolbox ?

k9ele7en commented 3 years ago

I follow the export ONNX as in example (https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/torchvision/classification_flow.py): kind of add quant_nn.TensorQuantizer.use_fb_fake_quant = True, quant_modules.initialize() before init model... you know. Ah I also use opset 13 for ONNX. Do you mean I need to add this part in the image into torch/onnx/symbolic_opset10.py?

ttyio commented 3 years ago

@k9ele7en , no need to change the symbolic_opset10.py, have you tried ngc container, e.g, nvcr.io/nvidian/pytorch:21.07? thanks

k9ele7en commented 3 years ago

I ran locally in Conda environment, with Torch 1.9, TensorRT 8 installed. You think problem come from torch.onnx and use ngc container may solve the it? I need some reason before try it, you know...

ttyio commented 3 years ago

@k9ele7en , seems some weights count not matching in the model, could you provide simple repro for debug? thanks

k9ele7en commented 3 years ago

Thanks @ttyio for giving some direction, but this is an internal project, I cannot share the explicit codes. In overall, I use monkey-patching to replace QuantConv2d layer in the original network, do PTQ, then convert into ONNX, but got error in RT convert step... Is RT 8.0 support INT8 Quantized model fully?

ttyio commented 3 years ago

@k9ele7en , is the automatic QAT model passed TRT? if so could you compare the difference between the automatic one and monkey patch one. We cannot run on TRT with all arbitrary quantization setting, like the per channel scale can only be added in the output feature channel of weights for convolution.

thanhnt-2658 commented 3 years ago

@ttyio tt "there's no _quantizer._amax like name from my model, don't known why." I got this problem too. Both PTQ and QAT .pth models have _amax but not in .onnx model.

thanhnt-2658 commented 3 years ago

@ShiinaMitsuki @k9ele7en @maoxiaoming86 Did you guys pass this issue? I would love to know. Thank you.

Scass0807 commented 2 years ago

@thanhnt-2658 Did you figure this out?

ttyio commented 2 years ago

Sorry for the delayed response. Seems we no longer have _amax in the exported onnx, might due to these _amax actually are unused weights. Could you use the _amax in the checkpoint? or maybe use the y_scale * 127 in the QuantizeLinear node as the _amax? Thanks

Scass0807 commented 2 years ago

Sorry for the delayed response. Seems we no longer have _amax in the exported onnx, might due to these _amax actually are unused weights. Could you use the _amax in the checkpoint? or maybe use the y_scale * 127 in the QuantizeLinear node as the _amax? Thanks

@ttyio do I still need to do this if I use TensorRT 8 or does it work automatically?

ttyio commented 2 years ago

@Scass0807 , for TRT8 and later you can direct import the ONNX with Q/DQ into trt, without manually call setDynamicRange.

maoxiaoming86 commented 2 years ago

@Scass0807 , for TRT8 and later you can direct import the ONNX with Q/DQ into trt, without manually call setDynamicRange.

Does TRT8 support ONNX with Q/DQ from pytorch-quantization? or from original of pytorch ? or both?

ttyio commented 2 years ago

@maoxiaoming86 , from pytorch-quantization, thanks!

lixiaolx commented 2 years ago

@ttyio Hello, can you explain how to get the value of amax? How is this amax calculated? Is there a corresponding calculation formula，after QAT model?

NVIDIA / TensorRT