CNN model opt int8 best practice example

korkland commented 1 month ago

Hi, can you share best practices for quantization for CNN models? Are the modelopt quantized PTQ is the way to go with tensorrt for cnn models (resnet retinanet etc)? I was able to quantize retinanet backbone to int8 but the lack of examples and practices makes me wonder if that is the way to go..

Thanks

riyadshairi979 commented 1 month ago

See the example of how to quantize CNN/ViTs using modelopt and deploy/evaluate with TensorRT. This is the recommended practice but note that TensorRT's implicit quantization may provide better performance for certain models. Please create an issue with reproducible instructions (model, command etc.) if thats the case.

korkland commented 1 month ago

I must say that I'm confused by the options that NVIDIA provides for quantization. We are targeting the Orin architecture and have our own CNN model based on RetinaNet. With the previous vendor, it was very clear: they had one tool. You would take your PyTorch model, convert it to ONNX, and use their tool for quantization, provide it config with nodes you want to quantize, calibration data etc..

With NVIDIA, there are too many options and we didn't found an option the satisfied our needs.

There is an implicit quantization ,which btw is deprecated from TRT 10, so i think we shouldn't go with this direction. I've tried it and it doesn't work out of the box - I'm getting this error maybe someone could help:

trtexec --onnx=orig.onnx --saveEngine=orig.trt --best

[shapeMachine.cpp::executeContinuation::905] Error Code 7: Internal Error (/interpret_2d/nms/strategy/Expand_1: ISliceLayer has out of bounds access on axis 0 Out of bounds access for slice. Instruction: CHECK_SLICE 287 0 300 1.)

is there an option to exclude the whole interpret_2d in implicit?

And explicit quantization:

PyTorch quantization, which we tried to use to quantize only the backbone, but the performance was worse than FP16. However, it supports a nice number of operations and nodes, so it a good candidate if the performance was satisfied.

code example of how we used it

import modelopt.torch.quantization as mtq
from torch.utils.data import DataLoader
from tqdm import tqdm
import copy

# The quantization algorithm requires calibration data. Below we show a rough example of how to
# set up a calibration data loader with the desired calib_size
data_loader = DataLoader(dataset, batch_size=1, shuffle=False, collate_fn=lambda x: x[0][0])

# Quantize the model and perform calibration (PTQ)
# CNN networks only supports INT8_DEFAULT_CFG
config = copy.deepcopy(mtq.INT8_DEFAULT_CFG)
config["quant_cfg"]["*head_2d*"] = {"enable": False}
config["quant_cfg"]["*interpret_2d*"] = {"enable": False}
config["quant_cfg"]["*head_3d*"] = {"enable": False}
config["quant_cfg"]["*output_quantizer"] = {"enable": False}
quantized_model = mtq.quantize(model, config, lambda model: [model(x) for x in tqdm(data_loader)])

# Print quantization summary after successfully quantizing the model with mtq.quantize
# This will show the quantizers inserted in the model and their configurations
mtq.print_quant_summary(quantized_model)

# Export to ONNX
input_keys, output_keys, const_folding, opset_vers = get_onnx_export_args(quantized_model, inputs_converted,
                                                                          network_name, module_name)

opset_version = 17
empty_kwargs = dict()
args_for_onnx = tuple([input, empty_kwargs])

torch.onnx.export(quantized_model, args_for_onnx, path,
                  do_constant_folding=const_folding,
                  opset_version=opset_version,
                  input_names=input_keys,
                  output_names=output_keys,
                  verbose=args.nsp_profile)
# check onnx model
onnx_model = onnx.load_model(path)
onnx_input_names = get_onnx_model_names(onnx_model.graph.input)
onnx_output_names = get_onnx_model_names(onnx_model.graph.output)
success = check_onnx_model(input_keys, output_keys, onnx_input_names, onnx_output_names)

ONNX API, which is pretty easy to use, but from digging into the code, it is basically 99% onnxruntime implementation, so I'm not sure what the advantages are here. However, the problem is with the small number of supported ops and nodes: ['Add', 'AveragePool', 'BatchNormalization', 'Clip', 'Conv', 'ConvTranspose', 'Gemm', 'GlobalAveragePool', 'MatMul', 'MaxPool', 'Mul']. There is no support for regular expressions, so I didn't figure out how to tell it to quantize only the backbone, for example. At the end, there was an improvement of around 10% in runtime which is disappointing.

Is there an option to manually add quantizers/dequantizers to ONNX quantization?

code example of how we used it

import modelopt.onnx.quantization.int8 as moq8
import modelopt.onnx.quantization.quantize as moq
from torch.utils.data import DataLoader
from onnxruntime.quantization.calibrate import CalibrationDataReader
from onnxruntime.quantization.shape_inference import quant_pre_process

input_keys, output_keys, const_folding, opset_vers = get_onnx_export_args(model, inputs_converted,
                                                                          network_name, module_name)
input = inputs_converted[0] if len(inputs_converted) == 1 else inputs_converted
opset_version = 17
empty_kwargs = dict()
args_for_onnx = tuple([input, empty_kwargs])

onnx_orig_path = path.replace(".onnx", "_orig.onnx")
torch.onnx.export(model, args_for_onnx, onnx_orig_path,
                  do_constant_folding=const_folding,
                  opset_version=opset_version,
                  input_names=input_keys,
                  output_names=output_keys,
                  verbose=args.nsp_profile)

# check onnx model
onnx_model = onnx.load_model(onnx_orig_path)
onnx_input_names = get_onnx_model_names(onnx_model.graph.input)
onnx_output_names = get_onnx_model_names(onnx_model.graph.output)
success = check_onnx_model(input_keys, output_keys, onnx_input_names, onnx_output_names)
if not success:
    print("ONNX checker Failed !.")
    exit(1)

# The quantization algorithm requires calibration data. Below we show a rough example of how to
# set up a calibration data loader with the desired calib_size
class OnnxCalibrationDataReader(CalibrationDataReader):
    def __init__(self, model, args, network_name, batch_size=1, shuffle=False, collate_fn=lambda x: x[0][0]):
        self.data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)
        self.data_iter = iter(self.data_loader)
    def get_next(self) -> dict:
        try:
            data = next(self.data_iter)
        except StopIteration:
            return None  # Indicates the end of the dataset

        # Convert the dictionary of tensors to a dictionary of numpy arrays
        data_numpy = {key: value.cpu().detach().numpy() for key, value in data.items()}
        return data_numpy

calib_reader = OnnxCalibrationDataReader(model, args, network_name)

# model preprocess
onnx_preprocessed_path = path.replace(".onnx", "_preprocessed.onnx")
quant_pre_process(onnx_orig_path, onnx_preprocessed_path, verbose=True, auto_merge=True)

moq8.quantize(onnx_path=onnx_preprocessed_path, output_path=path, calibration_data_reader=calib_reader, verbose=True)

btw, when using quant_pre_process the engine generation failed on the following, incase some could help:

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:828: While parsing node number 177 [ScatterND -> "/interpret_2d/nms/strategy/ScatterND_output_0"]:
[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:831: --- Begin node ---
input: "/interpret_2d/nms/strategy/Constant_17_output_0"
input: "/interpret_2d/nms/strategy/Constant_19_output_0"
input: "/interpret_2d/nms/strategy/Reshape_3_output_0"
output: "/interpret_2d/nms/strategy/ScatterND_output_0"
name: "/interpret_2d/nms/strategy/ScatterND"
op_type: "ScatterND"
attribute {
  name: "reduction"
  s: "none"
  type: STRING
}

[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:832: --- End node ---
[07/24/2024-09:02:34] [E] [TRT] ModelImporter.cpp:836: ERROR: onnxOpImporters.cpp:5119 In function importScatterND:
[9] Assertion failed: !attrs.count("reduction"): Attribute reduction is not supported.
[07/24/2024-09:02:34] [E] Failed to parse onnx file

Thanks

riyadshairi979 commented 1 month ago

trtexec --onnx=orig.onnx --saveEngine=orig.trt --best

If the original model doesn't compile with trtexec, you will need to fix the ONNX model first before quantizing it. You can file an issue here with reproducible model and commands.

Is this Retinanet similar to yours? It compiles to TensorRT 10 successfully. Quantize the model using modelopt and compile using trtexec: $ python -m modelopt.onnx.quantization --onnx_path=retinanet-9.onnx --quantize_mode=int8 $ trtexec --onnx=retinanet-9.quant.onnx --saveEngine=retinanet-9.quant.engine --best We observe 1.7x latency reduction of retinanet-9.quant.engine compared to fp16 TensorRT engine.

NVIDIA / TensorRT-Model-Optimizer

CNN model opt int8 best practice example #46