NVIDIA-AI-IOT / cuDLA-samples

YOLOv5 on Orin DLA
Other
167 stars 17 forks source link

why we should use apply_custom_rules_to_quantizer? #8

Open dahaiyidi opened 8 months ago

dahaiyidi commented 8 months ago

In quantize.py I find the following function. And it is used in qat.py. why should we find the quantizer_pairs? why should we set: major = bottleneck.cv1.conv._input_quantizer bottleneck.addop._input0_quantizer = major bottleneck.addop._input1_quantizer = major

`

def apply_custom_rules_to_quantizer(model : torch.nn.Module, export_onnx : Callable):

# apply rules to graph
export_onnx(model, "quantization-custom-rules-temp.onnx")
pairs = find_quantizer_pairs("quantization-custom-rules-temp.onnx")
print(pairs)
for major, sub in pairs:
    print(f"Rules: {sub} match to {major}")
    get_attr_with_path(model, sub)._input_quantizer = get_attr_with_path(model, major)._input_quantizer  # why use the same input_quantizer??
os.remove("quantization-custom-rules-temp.onnx")

for name, bottleneck in model.named_modules():
    if bottleneck.__class__.__name__ == "Bottleneck":
        if bottleneck.add:
            print(f"Rules: {name}.add match to {name}.cv1")
            major = bottleneck.cv1.conv._input_quantizer
            bottleneck.addop._input0_quantizer = major
            bottleneck.addop._input1_quantizer = major

`

Thanks.

liuanqi-libra7 commented 8 months ago

If we use https://github.com/NVIDIA-AI-IOT/cuDLA-samples/tree/main/export#option1, the generated model can also run on the GPU. However, If the Q&DQ nodes of these tensors are inconsistent, there are a lot of useless int8->fp16 and fp16->int8 data convert in our QAT model. This will slow down the model inference speed.