Closed dahaiyidi closed 2 months ago
If we use https://github.com/NVIDIA-AI-IOT/cuDLA-samples/tree/main/export#option1, the generated model can also run on the GPU. However, If the Q&DQ nodes of these tensors are inconsistent, there are a lot of useless int8->fp16 and fp16->int8 data convert in our QAT model. This will slow down the model inference speed.
In quantize.py I find the following function. And it is used in qat.py. why should we find the quantizer_pairs? why should we set: major = bottleneck.cv1.conv._input_quantizer bottleneck.addop._input0_quantizer = major bottleneck.addop._input1_quantizer = major
`
def apply_custom_rules_to_quantizer(model : torch.nn.Module, export_onnx : Callable):
`
Thanks.