Tried to convert a mixed precision onnx model to TensorRT FP16 engine

jinhonglu commented 3 months ago

Description

I tried to convert a mixed precision onnx model to mixed precision TensorRT engine.

In my mixed precision onnx model, I have kept some ops (ReduceSum, Pow) to be fp32, some back-to-back Cast Op to be fp32(For example, ReduceSum(fp32)->output(fp32)->Cast(fp32)->Pow(fp32))

In my build_engine.py, I set the following config, set obey precision and set the corresponding layers to be fp32

`
config.set_flag(trt.BuilderFlag.FP16) config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)

    for i in range(network.num_layers):
        op_name = network.get_layer(i).name.split('/')[-1]
        if 'Pow' == op_name or 'ReduceSum' == op_name or 'Pow_1' == op_name:
            print(network.get_layer(i).name)
            # input('test')
            network.get_layer(i).precision = trt.DataType.FLOAT
            network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)
        if 'Pow_1_output_cast0' == op_name or 'ReduceSum_input_cast1' == op_name or 'Pow_output_cast0' == op_name\
                or 'Pow_1_input_cast0' == op_name or 'ReduceSum_input_cast0' == op_name or 'Pow_input_cast0' == op_name:
            print(network.get_layer(i).name)
            network.get_layer(i).precision = trt.DataType.FLOAT

`

The result of the tensor engine is quite different from the onnx model.

Any idea that I can solve this?

Environment

TensorRT Version:

NVIDIA GPU: A100

NVIDIA Driver Version: 12.5

CUDA Version:12.5

CUDNN Version: 12.5

Operating System: Linux

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

lix19937 commented 3 months ago

What is the diff when you use all fp32 ?

jinhonglu commented 3 months ago

@lix19937

What is the diff when you use all fp32 ?

It is quite strange that the result of the model (onnxfp16->tensortfp32) is also totally different from the onnx fp16.

What is wrong with my build engine code? I have commented the set layer precision at all during building.

` def build_engine():

TRT_LOGGER = trt.Logger(trt.Logger.INFO)
TRT_BUILDER = trt.Builder(TRT_LOGGER)

for precision in BUILD:
    engine_filename = '_'.join([MODEL_NAME, gpu_name, precision]) + '.engine'
    if os.path.exists(engine_filename):
        print(f'Engine file {engine_filename} exists. Skip building...')
        continue

    print(f'Building {precision} engine of {MODEL_NAME} model on {gpu_name} GPU...')

    ## parse ONNX model
    network_creation_flag = 0
    if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
        network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    network = TRT_BUILDER.create_network(network_creation_flag)
    onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
    parse_success = onnx_parser.parse_from_file(ONNX_MODEL)
    for idx in range(onnx_parser.num_errors):
        print(onnx_parser.get_error(idx))
    if not parse_success:
        sys.exit('ONNX model parsing failed')

    ## build TRT engine (configuration options at: https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/BuilderConfig.html#ibuilderconfig)
    config = TRT_BUILDER.create_builder_config()

    # seq_len = network.get_input(0).shape[1]

    # handle dynamic shape (min/opt/max): https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
    # by default batch dim set as 1 for all min/opt/max. If there are batch need, change the value for opt and max accordingly
    profile = TRT_BUILDER.create_optimization_profile()
    profile.set_shape("input_ids", (1, 2, 1025, 690, 2), (1, 2, 1025, 690, 2), (1, 2, 1025, 690, 2))
    profile.set_shape("output", (1, 1, 2050, 690, 2), (1, 1, 2050, 690, 2), (1, 1, 2050, 690, 2))
    config.add_optimization_profile(profile)
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4096 * (1 << 20)) # 4096 MiB

    # precision
    if precision == 'fp32':
        config.clear_flag(trt.BuilderFlag.TF32) # TF32 enabled by default, need to clear flag
    elif precision == 'tf32':
        pass
    elif precision == 'fp16':
        config.set_flag(trt.BuilderFlag.FP16)
    config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)

    # for i in range(network.num_layers):
    #     op_name = network.get_layer(i).name.split('/')[-1]
    #     if 'Pow' == op_name or 'ReduceSum' == op_name or 'Pow_1' == op_name:
    #         print(network.get_layer(i).name)
    #         # input('test')
    #         network.get_layer(i).precision = trt.DataType.FLOAT
    #         network.get_layer(i).set_output_type(0, trt.DataType.FLOAT)
    #     if 'Pow_1_output_cast0' == op_name or 'ReduceSum_input_cast1' == op_name or 'Pow_output_cast0' == op_name\
    #             or 'Pow_1_input_cast0' == op_name or 'ReduceSum_input_cast0' == op_name or 'Pow_input_cast0' == op_name:
    #         print(network.get_layer(i).name)
    #         network.get_layer(i).precision = trt.DataType.FLOAT

    # build
    serialized_engine = TRT_BUILDER.build_serialized_network(network, config)

    ## save TRT engine
    with open(engine_filename, 'wb') as f:
        f.write(serialized_engine)
    print(f'Engine is saved to {engine_filename}')

`

lix19937 commented 3 months ago

You can use follow to compare the diff between trt and ort.

polygraphy run your_onnx_name.onnx --trt --onnxrt

BTW, if you use trtexec, you can upload the full log with follow cmd

trtexec --verbose --onnx=your_onnx_name.onnx    2>&1    |tee  build.log     

trtexec --verbose --onnx=your_onnx_name.onnx --fp16  2>&1 |tee  build_fp16.log

jinhonglu commented 3 months ago

@lix19937

You can use follow to compare the diff between trt and ort.

I have run both fp32 and fp16 polygraphy run fp16.onnx --trt --onnxrt (--fp16) --execution-providers=cuda

fp16 is here

fp32 is here

NVIDIA / TensorRT