NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.61k stars 2.11k forks source link

After Myelin optimization: 1 layers of TensorRT 8.6.13 when running Mymodel on GPU Nvidia drive orinx #4157

Open XuDeshengCat opened 6 days ago

XuDeshengCat commented 6 days ago

Description

When I tried to convert my onnx files to the tensorrt engine using tensorrt, the entire network was converted to a huge layer. However, my original model was very large and complex, with 11,329 layers in the Onnx file and only one layer optimized for myelin. This prevents me from analyzing the performance bottleneck. I'm sure there are some structures that can be optimized, such as the transformer structure and a lot of layernorm layers

I wonder if I can still speed up the model by writing plugins in this case?

Environment

TensorRT Version:8.6.13

NVIDIA GPU:Drive OrinX

NVIDIA Driver Version:12.3

CUDA Version:12.3

CUDNN Version:9

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

COMMAND="trtexec --onnx=model.onnx \ --saveEngine=model.trt \ --memPoolSize=workspace:32768 \ --fp16 \ --noTF32 \ --disableMHA \ --exportLayerInfo=model.json \ --exportProfile=model.json \ --useCudaGraph \ --profilingVerbosity=detailed \ --separateProfileRun \ --dumpLayerInfo \ --tacticSources=-CUBLAS,-CUBLAS_LT,-CUDNN,-JIT_CONVOLUTIONS,-EDGE_MASK_CONVOLUTIONS \ --verbose"

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): [02/13/2022-13:26:43] [V] [TRT] Original: 11329 layers [02/13/2022-13:26:43] [V] [TRT] After dead-layer removal: 11329 layers [02/13/2022-13:26:44] [V] [TRT] Graph construction completed in 0.893051 seconds. [02/13/2022-13:26:44] [V] [TRT] ---------- Layers Running on DLA ---------- [02/13/2022-13:26:44] [V] [TRT] No layer is running on DLA [02/13/2022-13:26:44] [V] [TRT] ---------- Layers Running on GPU ---------- [02/13/2022-13:26:44] [V] [TRT] [GpuLayer] CONSTANT: /pnp_encoder/a2a_cross_attention_layers.1/self_attn/Slice_2_output_0[Constant] [02/13/2022-13:26:44] [V] [TRT] [GpuLayer] CONSTANT: /Reshape_3_output_0[Constant] ....... [02/13/2022-13:26:45] [V] [TRT] After Myelin optimization: 1 layers [02/13/2022-13:26:45] [V] [TRT] Applying ScaleNodes fusions. [02/13/2022-13:26:45] [V] [TRT] After scale fusion: 1 layers [02/13/2022-13:26:45] [V] [TRT] After dupe layer removal: 1 layers [02/13/2022-13:26:45] [V] [TRT] After final dead-layer removal: 1 layers [02/13/2022-13:26:45] [V] [TRT] After tensor merging: 1 layers [02/13/2022-13:26:45] [V] [TRT] After vertical fusions: 1 layers [02/13/2022-13:26:45] [V] [TRT] After dupe layer removal: 1 layers [02/13/2022-13:26:45] [V] [TRT] After final dead-layer removal: 1 layers [02/13/2022-13:26:45] [V] [TRT] After tensor merging: 1 layers [02/13/2022-13:26:45] [V] [TRT] After slice removal: 1 layers [02/13/2022-13:26:45] [V] [TRT] After concat removal: 1 layers [02/13/2022-13:26:45] [V] [TRT] Trying to split Reshape and strided tensor [02/13/2022-13:26:45] [I] [TRT] Graph optimization time: 1.47146 seconds. [02/13/2022-13:26:45] [V] [TRT] Building graph using backend strategy 2 [02/13/2022-13:26:45] [I] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored. [02/13/2022-13:26:45] [V] [TRT] Constructing optimization profile number 0 [1/1]. [02/13/2022-13:26:46] [V] [TRT] Applying generic optimizations to the graph for inference. [02/13/2022-13:26:46] [V] [TRT] Reserving memory for host IO tensors. Host: 0 bytes [02/13/2022-13:26:46] [V] [TRT] =============== Computing costs for {ForeignNode[/sample_path_head/sample_anchor_feat_extractor/Constant_16_output_0.../trajectory_head/Squeeze_2]} [02/13/2022-13:26:46] [V] [TRT] Autotuning format combination: Float(51200,256,1), Float(4587520,28672,256,1), Float(40,2,1), Float(25600,256,1), Float(1200,3,1), Float(9,9,1), Float(3213,153,3,1), Float(100,1), Float(200,1), Float(1536,256,1), Float(2200,11,1) -> Int32(126,21,1), Half(6,1), Half(3,1), Int32(1), Half(120,2,1), Half(200,1,1), Half(24000,120,120,2,1), Half(1536,256,1), Half(120,2,1)

lix19937 commented 6 days ago

First you should know what is myelin, and why lead to myelin.

Then you need analysis the each layer/kernel latency in myelin, you can use nsys, follow bash-shell maybe help


#!/bin/bash
# how to use ? 
# ./profile.sh ${input.engine} ${tag}

IFS=. file=(${1})
PREFIX=${file[0]}

TAG=""
MODE=infer
ROOT=log

DIR=${ROOT}/${PREFIX}/${MODE}

if [[ ${2} != "" ]]
then
        TAG=${2}_
fi

mkdir -p ${ROOT}
mkdir -p ${ROOT}/${PREFIX}
mkdir -p ${ROOT}/${PREFIX}/${MODE}

nsys profile \
        --output=${DIR}/${PREFIX} \
        --force-overwrite true    \
        trtexec --loadEngine=${PREFIX}.engine \
                --warmUp=200    \
                --iterations=50 \

@XuDeshengCat

XuDeshengCat commented 6 days ago

I understand myelin as an optimization mechanism applied by tensorrt during the build engine phase, and I have looked at the analysis of the optimized trt via nsys.

My question is, can I speed up the model with self-pricing plugins when myelin has already centrally optimized my entire network to a single layer?

lix19937 commented 6 days ago

Yes, you can.