NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.57k stars 2.1k forks source link

Quantization flow using TensorRT (what is recommended for CNN?) #4024

Open korkland opened 1 month ago

korkland commented 1 month ago

I have commented the following in the ModelOpt issues, but since there is more activity here, I would like to get feedback on this subject from more people.

First of all, if someone here has positive experience with quantizing CNN models with NVIDIA tools, I would appreciate it if they could share their workflow, as the examples are very limited.

I must say that I'm confused by the options that NVIDIA provides for quantization. We are targeting the Orin architecture and have our own CNN model based on RetinaNet. With the previous vendor, it was very clear: they had one tool. You would take your PyTorch model, convert it to ONNX, and use their tool for quantization, providing it with a config with nodes you want to quantize, calibration data, etc.

With NVIDIA, there are too many options, and we haven't found one that satisfies our needs.

There is an implicit quantization, which, btw is deprecated from TRT 10, so i think we shouldn't go with this direction. I've tried it anyway, and it doesn't work out of the box. I didn't figure out how to exclude nodes from being quantized, and I'm getting this error on parts that shouldn't be quantized. Maybe someone could help:

trtexec --onnx=orig.onnx --saveEngine=orig.trt --best

[shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (/interpret_2d/nms/strategy/Expand_1: ISliceLayer has out of bounds access on axis 0 Out of bounds access for slice. Instruction: CHECK_SLICE 287 0 300 1.)

And there is explicit quantization:

For that solution to be optional, is there an option to manually add quantizers/dequantizers in this API?

When using quant_pre_process, the engine generation failed on the following error. If someone could help, it would be appreciated:

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:828: While parsing node number 177 [ScatterND -> "/interpret_2d/nms/strategy/ScatterND_output_0"]:
[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:831: --- Begin node ---
input: "/interpret_2d/nms/strategy/Constant_17_output_0"
input: "/interpret_2d/nms/strategy/Constant_19_output_0"
input: "/interpret_2d/nms/strategy/Reshape_3_output_0"
output: "/interpret_2d/nms/strategy/ScatterND_output_0"
name: "/interpret_2d/nms/strategy/ScatterND"
op_type: "ScatterND"
attribute {
  name: "reduction"
  s: "none"
  type: STRING
}

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:832: --- End node ---
[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:836: ERROR: onnxOpImporters.cpp:5119 In function importScatterND:
[9] Assertion failed: !attrs.count("reduction"): Attribute reduction is not supported.
[07/24/2024-09:02:34] [E] Failed to parse onnx file

It would be appreciated if someone could clarify the advanced quantization options. Thanks!

lix19937 commented 1 month ago

[shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (/interpret_2d/nms/strategy/Expand_1: ISliceLayer has out of bounds access on axis 0 Out of bounds access for slice. Instruction: CHECK_SLICE 287 0 300 1.)

slice node axis[0]= 300 is out-bound, check your onnx or use polygraphy to optimize it.

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:832: --- End node --- [07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:836: ERROR: onnxOpImporters.cpp:5119 In function importScatterND: [9] Assertion failed: !attrs.count("reduction"): Attribute reduction is not supported.

reduction mode of ScatterND is not supported. See https://docs.nvidia.com/deeplearning/tensorrt/operators/docs/Scatter.html

lix19937 commented 1 month ago

PyTorch quantization, which we tried to use to quantize only the backbone, but the performance was worse than FP16.

You can use QAT to improved acc.

korkland commented 1 month ago

PyTorch quantization, which we tried to use to quantize only the backbone, but the performance was worse than FP16.

You can use QAT to improved acc.

sorry maybe i wasn't clear, the performance in term of runtime was worse.. i didn't check the accuracy yet..

lix19937 commented 1 month ago

the performance of infer latency is worse, maybe not set q-dq right.

korkland commented 1 month ago

the performance of infer latency is worse, maybe not set q-dq right.

I've attached the config im using, thats why i would love to see a production level example of CNN quantization that produces good performance. I haven’t seen any examples that relevant to CNN models

lix19937 commented 1 month ago

For cnn model, you first makesure your trtexec --best build passed, usually IQ https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq#quantize-an-onnx-model maybe get best latency performance. If IQ not get accuracy, then turn to EQ. For EQ, see follow:

If your deployment is on Ampere GPUs or earlier, we recommend using INT4 AWQ or INT8 SQ. If you use int8 qat, you can use the earlier method to quant resnet-based model: https://github.com/NVIDIA/TensorRT/blob/release/10.2/tools/pytorch-quantization/examples/torchvision/classification_flow.py

For modelopt, current mainly support for LLM, need further improve usability.

korkland commented 1 month ago

For cnn model, you first makesure your trtexec --best build passed, usually IQ https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq#quantize-an-onnx-model maybe get best latency performance. If IQ not get accuracy, then turn to EQ. For EQ, see follow:

If your deployment is on Ampere GPUs or earlier, we recommend using INT4 AWQ or INT8 SQ. If you use int8 qat, you can use the earlier method to quant resnet-based model: https://github.com/NVIDIA/TensorRT/blob/release/10.2/tools/pytorch-quantization/examples/torchvision/classification_flow.py

For modelopt, current mainly support for LLM, need further improve usability.

Thanks @lix19937, so as i mentioned im using a basic setup,

trtexec --onnx=quantized.onnx --saveEngine=quantize.trt --best #or --int8 --fp16
VS
trtexec --onnx=orig.onnx --saveEngine=orig.trt --best #or --int8 --fp16

When using the modelopt.torch.quantization config as recommended by the docs, enabling only quantization for the backbone:

    config = copy.deepcopy(mtq.INT8_DEFAULT_CFG)
    # config["quant_cfg"]["backbone.*"] = {"enable": True}
    config["quant_cfg"]["*head_2d*"] = {"enable": False}
    config["quant_cfg"]["*interpret_2d*"] = {"enable": False}
    config["quant_cfg"]["*head_3d*"] = {"enable": False}
    config["quant_cfg"]["*output_quantizer"] = {"enable": False}

im getting worse latency results

when im switching to the modelopt.onnx.quantization.int8, without excluding any node and using --best im getting only ~10% improvements in latency.

This Makes me wonder, if i need to switch to pytorch_quantization (by nvidia) - which allows for inserting q/dq nodes but loses automation and requires model editing. Alternatively, I could switch to FX quantization (by pytroch, as far as i know eager mode not supported by tensorrt). Is there anything I can do with the ModelOpt APIs to improve the latency?

lix19937 commented 1 month ago

Inserting q/dq nodes by user is need more professionalism but more flexibility, and usually get best work.

I could switch to FX quantization (by pytroch, as far as i know eager mode not supported by tensorrt). Is there anything I can do with the ModelOpt APIs to improve the latency?

Yes, it's best not to use torch-FX quantization. Set smaller particle size in config maybe helpful.

korkland commented 1 month ago

Inserting q/dq nodes by user is need more professionalism but more flexibility, and usually get best work.

Definitely, I'm still surprised, though, that the API provided by NVIDIA for TensorRT is not doing basic fusion and model preparation.

here for example, the exported pytorch api model, before converting it to tensorrt: image as can be seen, q/dq are inserted between each layer, onnx in the other hand: image bn is fused into Conv.

moreover when you look at the tesnorrt graph for the pytorch: image poor modules unifying vs onnx image how can i achieve the same insertion heuristics as in the implicit quantization flow, just with more control over the module selection (lets say i want to quantize only the backbone, but using the iq heuristics) ?

korkland commented 1 month ago

[shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (/interpret_2d/nms/strategy/Expand_1: ISliceLayer has out of bounds access on axis 0 Out of bounds access for slice. Instruction: CHECK_SLICE 287 0 300 1.)

slice node axis[0]= 300 is out-bound, check your onnx or use polygraphy to optimize it.

can you please elaborate? this is my NMS implementation, how can we optimize it using polygraphy? what is the recommended option to use NMS if my target is to use tensorrt?

    @staticmethod
    def symbolic(g, boxes, scores, iou_threshold, max_count):
        assert type(iou_threshold) == float, "You have to pass iou_threshold as float type"
        assert type(max_count) == int, "You have to pass max_count as integer type"

        boxes = unsqueeze(g, boxes, 0)
        scores = unsqueeze(g, unsqueeze(g, scores, 0), 0)
        # this value is deducted to filter out zero values from padding
        epsilon_nms = 1e-5
        score_threshold = g.op('Constant', value_t=torch.tensor([0.0 - epsilon_nms], dtype=torch.float))

        iou_threshold = g.op("Constant", value_t=torch.tensor(iou_threshold))
        max_count = g.op('Constant', value_t=torch.tensor(max_count))

        nms_out = g.op('NonMaxSuppression', boxes, scores, max_count, iou_threshold, score_threshold)
        return squeeze(g, select(g, nms_out, 1, g.op('Constant', value_t=torch.tensor([2], dtype=torch.long))), 1)
ckolluru commented 1 week ago

I'm interested in learning best practices for quantizing CNN and image segmentation models as well.

In my case, I'm trying the pytorch explicit quantization using modelopt, but when I run mtq.quantize() it says Inserted 0 quantizers. The final ONNX model graph also doesn't show these Q/DQ layers.

I'm also curious why the enable fields in mtq.INT8_DEFAULT_CFG are all False by default. Would we not want quantization enabled by default? I set them all to True, but that didn't change the result.

{'quant_cfg': {'weight_quantizer': {'num_bits': 8, 'axis': 0}, 'input_quantizer': {'num_bits': 8, 'axis': None}, 'lm_head': {'enable': False}, 'block_sparse_moe.gate': {'enable': False}, 'router': {'enable': False}, 'output_layer': {'enable': False}, 'output.': {'enable': False}, 'nn.BatchNorm1d': {'': {'enable': False}}, 'nn.BatchNorm2d': {'': {'enable': False}}, 'nn.BatchNorm3d': {'': {'enable': False}}, 'nn.LeakyReLU': {'*': {'enable': False}}, 'default': {'enable': False}}, 'algorithm': 'max'}