NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.14k stars 2.07k forks source link

Inquiry about Layer Performance of FP16 #3876

Open minhhotboy9x opened 1 month ago

minhhotboy9x commented 1 month ago

Description

Hi, I'm newer to TensorRT and I'm trying to understand the layer performance. I read the doc Optimizing for Tensor Cores and see that with the FP16 precision, the dim of tensor should be multiples of 8 or 16. So I converted an ONNX model to an engine model, then I printed the layer information. Here is a part of it:

...
{
  "Name": "/model.2/cv1/conv/Conv + /model.2/cv1/act/Relu",
  "LayerType": "CaskConvolution",
  "Inputs": [
  {
    "Name": "/model.1/act/Relu_output_0",
    "Location": "Device",
    "Dimensions": [1,50,160,160],
    "Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
  }],
  "Outputs": [
  {
    "Name": "Reformatted Output Tensor 0 to /model.2/cv1/conv/Conv + /model.2/cv1/act/Relu",
    "Location": "Device",
    "Dimensions": [1,25,160,160],
    "Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
  }],
  "ParameterType": "Convolution",
  "Kernel": [1,1],
  "PaddingMode": "kEXPLICIT_ROUND_DOWN",
  "PrePadding": [0,0],
  "PostPadding": [0,0],
  "Stride": [1,1],
  "Dilation": [1,1],
  "OutMaps": 25,
  "Groups": 1,
  "Weights": {"Type": "Half", "Count": 1250},
  "Bias": {"Type": "Half", "Count": 25},
  "HasSparseWeights": 0,
  "HasDynamicFilter": 0,
  "HasDynamicBias": 0,
  "HasResidual": 0,
  "ConvXAsActInputIdx": -1,
  "BiasAsActInputIdx": -1,
  "ResAsActInputIdx": -1,
  "Activation": "RELU",
  "HasBias": 1,
  "HasReLU": 1,
  "TacticName": "sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x32_stage4_warpsize4x1x1_g1_tensor16x8x16_t1r1s1",
  "TacticValue": "0xb4ed47991b2d81ae",
  "StreamId": 0,
  "Metadata": "[ONNX Layer: /model.2/cv1/conv/Conv]\u001e[ONNX Layer: /model.2/cv1/act/Relu]"
},{
  "Name": "Reformatting CopyNode for Output Tensor 0 to /model.2/cv1/conv/Conv + /model.2/cv1/act/Relu",
  "LayerType": "Reformat",
  "Inputs": [
  {
    "Name": "Reformatted Output Tensor 0 to /model.2/cv1/conv/Conv + /model.2/cv1/act/Relu",
    "Location": "Device",
    "Dimensions": [1,25,160,160],
    "Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
  }],
  "Outputs": [
  {
    "Name": "/model.2/cv1/act/Relu_output_0",
    "Location": "Device",
    "Dimensions": [1,25,160,160],
    "Format/Datatype": "Channel major FP16 format where channel % 2 == 0"
  }],
  "ParameterType": "Reformat",
  "Origin": "REFORMAT",
  "TacticValue": "0x00000000000003ea",
  "StreamId": 0,
  "Metadata": ""
},{
  "Name": "/model.2/m.0/cv1/conv/Conv + /model.2/m.0/cv1/act/Relu",
  "LayerType": "CaskConvolution",
  "Inputs": [
  {
    "Name": "/model.2/cv1/act/Relu_output_0",
    "Location": "Device",
    "Dimensions": [1,25,160,160],
    "Format/Datatype": "Channel major FP16 format where channel % 2 == 0"
  }],
  "Outputs": [
  {
    "Name": "/model.2/m.0/cv1/act/Relu_output_0",
    "Location": "Device",
    "Dimensions": [1,25,160,160],
    "Format/Datatype": "Channel major FP16 format where channel % 2 == 0"
  }],
  "ParameterType": "Convolution",
  "Kernel": [3,3],
  "PaddingMode": "kEXPLICIT_ROUND_DOWN",
  "PrePadding": [1,1],
  "PostPadding": [1,1],
  "Stride": [1,1],
  "Dilation": [1,1],
  "OutMaps": 25,
  "Groups": 1,
  "Weights": {"Type": "Half", "Count": 5625},
  "Bias": {"Type": "Half", "Count": 25},
  "HasSparseWeights": 0,
  "HasDynamicFilter": 0,
  "HasDynamicBias": 0,
  "HasResidual": 0,
  "ConvXAsActInputIdx": -1,
  "BiasAsActInputIdx": -1,
  "ResAsActInputIdx": -1,
  "Activation": "RELU",
  "HasBias": 1,
  "HasReLU": 1,
  "TacticName": "sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x32x64_stage1_warpsize4x1x1_g1_tensor16x8x16_aligna4_alignc4",
  "TacticValue": "0xa1c540a5038e4190",
  "StreamId": 0,
  "Metadata": "[ONNX Layer: /model.2/m.0/cv1/conv/Conv]\u001e[ONNX Layer: /model.2/m.0/cv1/act/Relu]"
}
...

I see the description "Format/Datatype": "Channel major FP16 format where channel % 8 == 0" and "Format/Datatype": "Channel major FP16 format where channel % 2 == 0". I don't know what this means because my channel is not divisible by 8 "Dimensions": [1,25,160,160], and is my model optimized?

Sorry for my bad English.

Environment

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

lix19937 commented 1 month ago

You can upload the onnx.

zerollzeng commented 1 month ago

I see the description "Format/Datatype": "Channel major FP16 format where channel % 8 == 0" and "Format/Datatype": "Channel major FP16 format where channel % 2 == 0". I don't know what this means because my channel is not divisible by 8 "Dimensions": [1,25,160,160], and is my model optimized?

It's vectorized format and trt will pad the tensor to the target format. you can refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-format-desc

minhhotboy9x commented 1 month ago

@zerollzeng Oh I see. However, the data format of each layer is auto-chosen for the best performance, right? Since I convert on my Jetson Nano the layers is converted to datatype "Two wide channel vectorized row major FP16 format" CHW2.

minhhotboy9x commented 1 month ago

@lix19937 Here is my onnx v8s_pruned. This onnx is exported from Ultralytics so it has meta data. So, I use below python script to convert:

import argparse
import os
import json
import tensorrt as trt
from datetime import datetime
import onnx
import calibration

TRT_LOGGER = trt.Logger()

def parse_args():
    parser = argparse.ArgumentParser(description='Convert ONNX models to TensorRT')

    # Sample image
    parser.add_argument('--batch_size', type=int, help='data batch size',
        default=1)
    parser.add_argument('--img_size', help='input size',
        default=[3, 640, 640])

    # Model path
    parser.add_argument('--onnx_model_path',  help='onnx model path',
        default='./onnx_model.onnx')
    parser.add_argument('--tensorrt_engine_path',  help='tensorrt engine path',
        default='./yolov5s_640_384_pfg_dynamic_max_batchsize_8_FP16.engine')

    # TensorRT engine params
    parser.add_argument('--dynamic_axes', help='dynamic batch input or output',
        default='True')
    parser.add_argument('--engine_precision', help='precision of TensorRT engine', choices=['FP32', 'FP16', 'INT8'], 
        default='FP16')
    parser.add_argument('--min_engine_batch_size', type=int, help='set the min input data size of model for inference', 
        default=1)
    parser.add_argument('--opt_engine_batch_size', type=int, help='set the most used input data size of model for inference', 
        default=1)
    parser.add_argument('--max_engine_batch_size', type=int, help='set the max input data size of model for inference', 
        default=1)
    parser.add_argument('--engine_workspace', type=int, help='workspace of engine', 
        default=4)
    # Optional argument for INT8 precision
    parser.add_argument('--data_calib', type=str, help='img data directory for int8 calibration', default='datasets/VOC/images/val2007')

    args = string_to_bool(parser.parse_args())

    if args.engine_precision == 'INT8' and args.data_calib is None:
        parser.error("--data_calib is required when --engine_precision is set to INT8")

    return args

def extract_metadata(onnx_model_path):
    # Load ONNX model
    model_onnx = onnx.load(onnx_model_path)

    # Extract metadata
    metadata = {}
    for prop in model_onnx.metadata_props:
        metadata[prop.key] = prop.value
    return metadata

def string_to_bool(args):
    if args.dynamic_axes.lower() in ('true'): args.dynamic_axes = True
    else: args.dynamic_axes = False
    return args

def build_engine(onnx_model_path, tensorrt_engine_path, engine_precision, dynamic_axes, \
    img_size, batch_size, min_engine_batch_size, opt_engine_batch_size, max_engine_batch_size,\
        engine_workspace, data_calib):
    metadata = extract_metadata(onnx_model_path)
    print(metadata)
    # Builder
    logger = trt.Logger(trt.Logger.ERROR)
    builder = trt.Builder(logger)
    network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

    if engine_precision == "INT8":
        print('PTQ enabled!')
        network_flags = network_flags | (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_PRECISION))

    network = builder.create_network(network_flags)

    profile = builder.create_optimization_profile()

    config = builder.create_builder_config()

    config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED

    # Set FP16 
    if engine_precision == 'FP16':
        config.set_flag(trt.BuilderFlag.FP16)
    elif engine_precision == 'INT8':
        config.flags |= 1 << int(trt.BuilderFlag.INT8)
        config.flags |= 1 << int(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
        calib_loader = calibration.DataLoader(batch_size, 128, data_calib, 640, 640)
        config.int8_calibrator = calibration.Calibrator(calib_loader, data_calib + '.cache')

    # Onnx parser
    parser = trt.OnnxParser(network, logger)

    if not os.path.exists(onnx_model_path):
        print("Failed finding ONNX file!")
        exit()
    print("Succeeded finding ONNX file!")
    with open(onnx_model_path, "rb") as model:
        if not parser.parse(model.read()):
            print("Failed parsing .onnx file!")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            exit()
        print("Succeeded parsing .onnx file!")

    # Input
    inputTensor = network.get_input(0) 
    # Dynamic batch (min, opt, max)
    print('inputTensor.name:', inputTensor.name)
    if dynamic_axes:
        profile.set_shape(inputTensor.name, (min_engine_batch_size, img_size[0], img_size[1], img_size[2]), \
            (opt_engine_batch_size, img_size[0], img_size[1], img_size[2]), \
            (max_engine_batch_size, img_size[0], img_size[1], img_size[2]))
        print('Set dynamic')
    else:
        profile.set_shape(inputTensor.name, (batch_size, img_size[0], img_size[1], img_size[2]), \
            (batch_size, img_size[0], img_size[1], img_size[2]), \
            (batch_size, img_size[0], img_size[1], img_size[2]))
    config.add_optimization_profile(profile)
    #network.unmark_output(network.get_output(0))

    # Write engine
    engineString = builder.build_serialized_network(network, config)
    if engineString == None:
        print("Failed building engine!")
        exit()
    print("Succeeded building engine!")

    # Chuyển từ dictionary sang JSON và encode nó
    metaString = json.dumps(metadata).encode('utf-8')

    # Lưu engine cùng với metadata vào file
    with open(tensorrt_engine_path, "wb") as f:
        # Ghi độ dài của metadata
        f.write(len(metaString).to_bytes(4, byteorder='little'))
        # Ghi metadata
        f.write(metaString)
        # Ghi engine
        f.write(engineString)

def main():
    args = parse_args()    
    # Build TensorRT engine
    build_engine(args.onnx_model_path, args.tensorrt_engine_path, args.engine_precision, args.dynamic_axes, \
        args.img_size, args.batch_size, args.min_engine_batch_size, args.opt_engine_batch_size, \
        args.max_engine_batch_size, args.engine_workspace, args.data_calib)

if __name__ == '__main__': 
    main()