Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64

python export.py --mode end2end --weights yolov8n.pt --batch 1 --img-size 640 640 --opset 11 --onnx yolov8n.onnx --cfg yolov8n.yaml --workspace 10 --fp16 --conf-thres 0.2 --nms-thres 0.45 --topk 300 --keep 100 Fusing layers... YOLOv8n summary: 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs Ultralytics YOLOv8.0.3 🚀 Python-3.9.12 torch-1.12.1+cu113 CPU half=True only compatible with GPU or CoreML export, i.e. use device=0 or format=coreml Fusing layers... YOLOv8n summary: 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs

PyTorch: starting from yolov8n.pt with output shape (1, 84, 8400) (6.2 MB)

ONNX: starting export with onnx 1.15.0... ONNX: export success ✅ 2.3s, saved as ./yolo_export/yolov8n/yolov8n.onnx (12.2 MB)

Export complete (2.5s) Results saved to /project/trt/yolov8_trt Predict: yolo task=detect mode=predict model=./yolo_export/yolov8n/yolov8n.onnx -WARNING ⚠️ not yet supported for YOLOv8 exported models Validate: yolo task=detect mode=val model=./yolo_export/yolov8n/yolov8n.onnx -WARNING ⚠️ not yet supported for YOLOv8 exported models Visualize: https://netron.app 2023-12-07 10:38:19.932 | INFO | utils.export_utils:info:40 - origin input size: [1, 3, 640, 640] 2023-12-07 10:38:19.932 | INFO | utils.export_utils:info:41 - origin output size: [1, 84, 8400] 2023-12-07 10:38:19.933 | INFO | utils.export_utils:add_postprocess:57 - add transpose layer: [1, 84, 8400] ---→ [1, 8400, 84] 2023-12-07 10:38:19.933 | INFO | utils.export_utils:add_without_obj_conf:68 - add layers without obj conf 2023-12-07 10:38:19.937 | INFO | utils.export_utils:add_nms:241 - start add nms layers 2023-12-07 10:38:19.965 | INFO | utils.export_utils:add_nms:265 - onnx file saved to yolo_export/yolov8n/yolov8n_end2end.onnx &&&& RUNNING TensorRT.trtexec [TensorRT v8204] # trtexec --onnx=yolo_export/yolov8n/yolov8n_end2end.onnx --saveEngine=yolo_export/yolov8n/yolov8n_end2end.engine --workspace=10240 --fp16 [12/07/2023-10:38:20] [I] === Model Options === [12/07/2023-10:38:20] [I] Format: ONNX [12/07/2023-10:38:20] [I] Model: yolo_export/yolov8n/yolov8n_end2end.onnx [12/07/2023-10:38:20] [I] Output: [12/07/2023-10:38:20] [I] === Build Options === [12/07/2023-10:38:20] [I] Max batch: explicit batch [12/07/2023-10:38:20] [I] Workspace: 10240 MiB [12/07/2023-10:38:20] [I] minTiming: 1 [12/07/2023-10:38:20] [I] avgTiming: 8 [12/07/2023-10:38:20] [I] Precision: FP32+FP16 [12/07/2023-10:38:20] [I] Calibration: [12/07/2023-10:38:20] [I] Refit: Disabled [12/07/2023-10:38:20] [I] Sparsity: Disabled [12/07/2023-10:38:20] [I] Safe mode: Disabled [12/07/2023-10:38:20] [I] DirectIO mode: Disabled [12/07/2023-10:38:20] [I] Restricted mode: Disabled [12/07/2023-10:38:20] [I] Save engine: yolo_export/yolov8n/yolov8n_end2end.engine [12/07/2023-10:38:20] [I] Load engine: [12/07/2023-10:38:20] [I] Profiling verbosity: 0 [12/07/2023-10:38:20] [I] Tactic sources: Using default tactic sources [12/07/2023-10:38:20] [I] timingCacheMode: local [12/07/2023-10:38:20] [I] timingCacheFile: [12/07/2023-10:38:20] [I] Input(s)s format: fp32:CHW [12/07/2023-10:38:20] [I] Output(s)s format: fp32:CHW [12/07/2023-10:38:20] [I] Input build shapes: model [12/07/2023-10:38:20] [I] Input calibration shapes: model [12/07/2023-10:38:20] [I] === System Options === [12/07/2023-10:38:20] [I] Device: 0 [12/07/2023-10:38:20] [I] DLACore: [12/07/2023-10:38:20] [I] Plugins: [12/07/2023-10:38:20] [I] === Inference Options === [12/07/2023-10:38:20] [I] Batch: Explicit [12/07/2023-10:38:20] [I] Input inference shapes: model [12/07/2023-10:38:20] [I] Iterations: 10 [12/07/2023-10:38:20] [I] Duration: 3s (+ 200ms warm up) [12/07/2023-10:38:20] [I] Sleep time: 0ms [12/07/2023-10:38:20] [I] Idle time: 0ms [12/07/2023-10:38:20] [I] Streams: 1 [12/07/2023-10:38:20] [I] ExposeDMA: Disabled [12/07/2023-10:38:20] [I] Data transfers: Enabled [12/07/2023-10:38:20] [I] Spin-wait: Disabled [12/07/2023-10:38:20] [I] Multithreading: Disabled [12/07/2023-10:38:20] [I] CUDA Graph: Disabled [12/07/2023-10:38:20] [I] Separate profiling: Disabled [12/07/2023-10:38:20] [I] Time Deserialize: Disabled [12/07/2023-10:38:20] [I] Time Refit: Disabled [12/07/2023-10:38:20] [I] Skip inference: Disabled [12/07/2023-10:38:20] [I] Inputs: [12/07/2023-10:38:20] [I] === Reporting Options === [12/07/2023-10:38:20] [I] Verbose: Disabled [12/07/2023-10:38:20] [I] Averages: 10 inferences [12/07/2023-10:38:20] [I] Percentile: 99 [12/07/2023-10:38:20] [I] Dump refittable layers:Disabled [12/07/2023-10:38:20] [I] Dump output: Disabled [12/07/2023-10:38:20] [I] Profile: Disabled [12/07/2023-10:38:20] [I] Export timing to JSON file: [12/07/2023-10:38:20] [I] Export output to JSON file: [12/07/2023-10:38:20] [I] Export profile to JSON file: [12/07/2023-10:38:20] [I] [12/07/2023-10:38:20] [I] === Device Information === [12/07/2023-10:38:20] [I] Selected Device: Tesla T4 [12/07/2023-10:38:20] [I] Compute Capability: 7.5 [12/07/2023-10:38:20] [I] SMs: 40 [12/07/2023-10:38:20] [I] Compute Clock Rate: 1.59 GHz [12/07/2023-10:38:20] [I] Device Global Memory: 14971 MiB [12/07/2023-10:38:20] [I] Shared Memory per SM: 64 KiB [12/07/2023-10:38:20] [I] Memory Bus Width: 256 bits (ECC enabled) [12/07/2023-10:38:20] [I] Memory Clock Rate: 5.001 GHz [12/07/2023-10:38:20] [I] [12/07/2023-10:38:20] [I] TensorRT version: 8.2.4 [12/07/2023-10:38:20] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 333, GPU 1244 (MiB) [12/07/2023-10:38:21] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 333 MiB, GPU 1244 MiB [12/07/2023-10:38:21] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 468 MiB, GPU 1278 MiB [12/07/2023-10:38:21] [I] Start parsing network model [12/07/2023-10:38:21] [I] [TRT] ---------------------------------------------------------------- [12/07/2023-10:38:21] [I] [TRT] Input filename: yolo_export/yolov8n/yolov8n_end2end.onnx [12/07/2023-10:38:21] [I] [TRT] ONNX IR version: 0.0.9 [12/07/2023-10:38:21] [I] [TRT] Opset version: 11 [12/07/2023-10:38:21] [I] [TRT] Producer name: pytorch [12/07/2023-10:38:21] [I] [TRT] Producer version: 1.12.1 [12/07/2023-10:38:21] [I] [TRT] Domain:
[12/07/2023-10:38:21] [I] [TRT] Model version: 0 [12/07/2023-10:38:21] [I] [TRT] Doc string:
[12/07/2023-10:38:21] [I] [TRT] ---------------------------------------------------------------- [12/07/2023-10:38:21] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [12/07/2023-10:38:21] [E] [TRT] ModelImporter.cpp:773: While parsing node number 124 [Resize -> "onnx::Concat_259"]: [12/07/2023-10:38:21] [E] [TRT] ModelImporter.cpp:774: --- Begin node --- [12/07/2023-10:38:21] [E] [TRT] ModelImporter.cpp:775: input: "onnx::Resize_254" input: "onnx::Resize_258" input: "onnx::Resize_420" output: "onnx::Concat_259" name: "Resize_120" op_type: "Resize" attribute { name: "coordinate_transformation_mode" s: "asymmetric" type: STRING } attribute { name: "cubic_coeff_a" f: -0.75 type: FLOAT } attribute { name: "mode" s: "nearest" type: STRING } attribute { name: "nearest_mode" s: "floor" type: STRING }

[12/07/2023-10:38:21] [E] [TRT] ModelImporter.cpp:776: --- End node --- [12/07/2023-10:38:21] [E] [TRT] ModelImporter.cpp:779: ERROR: builtin_op_importers.cpp:3608 In function importResize: [8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!" [12/07/2023-10:38:21] [E] Failed to parse onnx file [12/07/2023-10:38:21] [I] Finish parsing network model [12/07/2023-10:38:21] [E] Parsing model failed [12/07/2023-10:38:21] [E] Failed to create engine from model. [12/07/2023-10:38:21] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8204] # trtexec --onnx=yolo_export/yolov8n/yolov8n_end2end.onnx --saveEngine=yolo_export/yolov8n/yolov8n_end2end.engine --workspace=10240 --fp16 2023-12-07 10:38:21.586 | ERROR | main:end2end:211 - Convert to engine file failed.

generally it is because of the version of tensorrt(or pytorch? maybe), just see my terminal output and change your version the same as mine.

btw, I've found a little bug in export.py and I've solved it, please update your export.py

lsh@MI:~/code/python/yolov8_trt$ python export.py --mode end2end --weights yolov8n.pt --batch 1 --img-size 640 640 --onnx yolov8n.onnx --cfg yolov8n.yaml --workspace 1 --fp16 --conf-thres 0.2 --nms-thres 0.45 --topk 300 --keep 100
Fusing layers... 
YOLOv8n summary: 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
Ultralytics YOLOv8.0.3 🚀 Python-3.8.10 torch-1.10.1+cu111 CPU
half=True only compatible with GPU or CoreML export, i.e. use device=0 or format=coreml
Fusing layers... 
YOLOv8n summary: 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs

PyTorch: starting from yolov8n.pt with output shape (1, 84, 8400) (6.2 MB)

ONNX: starting export with onnx 1.14.0...
ONNX: export success ✅ 1.5s, saved as ./yolo_export/yolov8n/yolov8n.onnx (12.2 MB)

Export complete (2.5s)
Results saved to /home/lsh/code/python/yolov8_trt
Predict:         yolo task=detect mode=predict model=./yolo_export/yolov8n/yolov8n.onnx -WARNING ⚠️ not yet supported for YOLOv8 exported models
Validate:        yolo task=detect mode=val model=./yolo_export/yolov8n/yolov8n.onnx -WARNING ⚠️ not yet supported for YOLOv8 exported models
Visualize:       https://netron.app
2023-12-07 11:13:30.130 | INFO     | utils.export_utils:info:40 - origin input size: [1, 3, 640, 640]
2023-12-07 11:13:30.130 | INFO     | utils.export_utils:info:41 - origin output size: [1, 84, 8400]
2023-12-07 11:13:30.131 | INFO     | utils.export_utils:add_postprocess:57 - add transpose layer: [1, 84, 8400] ---→ [1, 8400, 84]
2023-12-07 11:13:30.131 | INFO     | utils.export_utils:add_without_obj_conf:68 - add layers without obj conf
2023-12-07 11:13:30.133 | INFO     | utils.export_utils:add_nms:241 - start add nms layers
2023-12-07 11:13:30.161 | INFO     | utils.export_utils:add_nms:265 - onnx file saved to yolo_export/yolov8n/yolov8n_end2end.onnx
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=yolo_export/yolov8n/yolov8n_end2end.onnx --saveEngine=yolo_export/yolov8n/yolov8n_end2end.engine --workspace=1024 --fp16
[12/07/2023-11:13:30] [W] --workspace flag has been deprecated by --memPoolSize flag.
[12/07/2023-11:13:30] [I] === Model Options ===
[12/07/2023-11:13:30] [I] Format: ONNX
[12/07/2023-11:13:30] [I] Model: yolo_export/yolov8n/yolov8n_end2end.onnx
[12/07/2023-11:13:30] [I] Output:
[12/07/2023-11:13:30] [I] === Build Options ===
[12/07/2023-11:13:30] [I] Max batch: explicit batch
[12/07/2023-11:13:30] [I] Memory Pools: workspace: 1024 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[12/07/2023-11:13:30] [I] minTiming: 1
[12/07/2023-11:13:30] [I] avgTiming: 8
[12/07/2023-11:13:30] [I] Precision: FP32+FP16
[12/07/2023-11:13:30] [I] LayerPrecisions: 
[12/07/2023-11:13:30] [I] Layer Device Types: 
[12/07/2023-11:13:30] [I] Calibration: 
[12/07/2023-11:13:30] [I] Refit: Disabled
[12/07/2023-11:13:30] [I] Version Compatible: Disabled
[12/07/2023-11:13:30] [I] TensorRT runtime: full
[12/07/2023-11:13:30] [I] Lean DLL Path: 
[12/07/2023-11:13:30] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[12/07/2023-11:13:30] [I] Exclude Lean Runtime: Disabled
[12/07/2023-11:13:30] [I] Sparsity: Disabled
[12/07/2023-11:13:30] [I] Safe mode: Disabled
[12/07/2023-11:13:30] [I] Build DLA standalone loadable: Disabled
[12/07/2023-11:13:30] [I] Allow GPU fallback for DLA: Disabled
[12/07/2023-11:13:30] [I] DirectIO mode: Disabled
[12/07/2023-11:13:30] [I] Restricted mode: Disabled
[12/07/2023-11:13:30] [I] Skip inference: Disabled
[12/07/2023-11:13:30] [I] Save engine: yolo_export/yolov8n/yolov8n_end2end.engine
[12/07/2023-11:13:30] [I] Load engine: 
[12/07/2023-11:13:30] [I] Profiling verbosity: 0
[12/07/2023-11:13:30] [I] Tactic sources: Using default tactic sources
[12/07/2023-11:13:30] [I] timingCacheMode: local
[12/07/2023-11:13:30] [I] timingCacheFile: 
[12/07/2023-11:13:30] [I] Heuristic: Disabled
[12/07/2023-11:13:30] [I] Preview Features: Use default preview flags.
[12/07/2023-11:13:30] [I] MaxAuxStreams: -1
[12/07/2023-11:13:30] [I] BuilderOptimizationLevel: -1
[12/07/2023-11:13:30] [I] Input(s)s format: fp32:CHW
[12/07/2023-11:13:30] [I] Output(s)s format: fp32:CHW
[12/07/2023-11:13:30] [I] Input build shapes: model
[12/07/2023-11:13:30] [I] Input calibration shapes: model
[12/07/2023-11:13:30] [I] === System Options ===
[12/07/2023-11:13:30] [I] Device: 0
[12/07/2023-11:13:30] [I] DLACore: 
[12/07/2023-11:13:30] [I] Plugins:
[12/07/2023-11:13:30] [I] setPluginsToSerialize:
[12/07/2023-11:13:30] [I] dynamicPlugins:
[12/07/2023-11:13:30] [I] ignoreParsedPluginLibs: 0
[12/07/2023-11:13:30] [I] 
[12/07/2023-11:13:30] [I] === Inference Options ===
[12/07/2023-11:13:30] [I] Batch: Explicit
[12/07/2023-11:13:30] [I] Input inference shapes: model
[12/07/2023-11:13:30] [I] Iterations: 10
[12/07/2023-11:13:30] [I] Duration: 3s (+ 200ms warm up)
[12/07/2023-11:13:30] [I] Sleep time: 0ms
[12/07/2023-11:13:30] [I] Idle time: 0ms
[12/07/2023-11:13:30] [I] Inference Streams: 1
[12/07/2023-11:13:30] [I] ExposeDMA: Disabled
[12/07/2023-11:13:30] [I] Data transfers: Enabled
[12/07/2023-11:13:30] [I] Spin-wait: Disabled
[12/07/2023-11:13:30] [I] Multithreading: Disabled
[12/07/2023-11:13:30] [I] CUDA Graph: Disabled
[12/07/2023-11:13:30] [I] Separate profiling: Disabled
[12/07/2023-11:13:30] [I] Time Deserialize: Disabled
[12/07/2023-11:13:30] [I] Time Refit: Disabled
[12/07/2023-11:13:30] [I] NVTX verbosity: 0
[12/07/2023-11:13:30] [I] Persistent Cache Ratio: 0
[12/07/2023-11:13:30] [I] Inputs:
[12/07/2023-11:13:30] [I] === Reporting Options ===
[12/07/2023-11:13:30] [I] Verbose: Disabled
[12/07/2023-11:13:30] [I] Averages: 10 inferences
[12/07/2023-11:13:30] [I] Percentiles: 90,95,99
[12/07/2023-11:13:30] [I] Dump refittable layers:Disabled
[12/07/2023-11:13:30] [I] Dump output: Disabled
[12/07/2023-11:13:30] [I] Profile: Disabled
[12/07/2023-11:13:30] [I] Export timing to JSON file: 
[12/07/2023-11:13:30] [I] Export output to JSON file: 
[12/07/2023-11:13:30] [I] Export profile to JSON file: 
[12/07/2023-11:13:30] [I] 
[12/07/2023-11:13:30] [I] === Device Information ===
[12/07/2023-11:13:30] [I] Selected Device: NVIDIA GeForce MX450
[12/07/2023-11:13:30] [I] Compute Capability: 7.5
[12/07/2023-11:13:30] [I] SMs: 14
[12/07/2023-11:13:30] [I] Device Global Memory: 1870 MiB
[12/07/2023-11:13:30] [I] Shared Memory per SM: 64 KiB
[12/07/2023-11:13:30] [I] Memory Bus Width: 64 bits (ECC disabled)
[12/07/2023-11:13:30] [I] Application Compute Clock Rate: 1.575 GHz
[12/07/2023-11:13:30] [I] Application Memory Clock Rate: 3.501 GHz
[12/07/2023-11:13:30] [I] 
[12/07/2023-11:13:30] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[12/07/2023-11:13:30] [I] 
[12/07/2023-11:13:30] [I] TensorRT version: 8.6.1
[12/07/2023-11:13:30] [I] Loading standard plugins
[12/07/2023-11:13:30] [I] [TRT] [MemUsageChange] Init CUDA: CPU +209, GPU +0, now: CPU 214, GPU 132 (MiB)
[12/07/2023-11:13:36] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +740, GPU +174, now: CPU 1030, GPU 306 (MiB)
[12/07/2023-11:13:36] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[12/07/2023-11:13:36] [I] Start parsing network model.
[12/07/2023-11:13:36] [I] [TRT] ----------------------------------------------------------------
[12/07/2023-11:13:36] [I] [TRT] Input filename:   yolo_export/yolov8n/yolov8n_end2end.onnx
[12/07/2023-11:13:36] [I] [TRT] ONNX IR version:  0.0.7
[12/07/2023-11:13:36] [I] [TRT] Opset version:    11
[12/07/2023-11:13:36] [I] [TRT] Producer name:    
[12/07/2023-11:13:36] [I] [TRT] Producer version: 
[12/07/2023-11:13:36] [I] [TRT] Domain:           
[12/07/2023-11:13:36] [I] [TRT] Model version:    0
[12/07/2023-11:13:36] [I] [TRT] Doc string:       
[12/07/2023-11:13:36] [I] [TRT] ----------------------------------------------------------------
[12/07/2023-11:13:36] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/07/2023-11:13:36] [I] [TRT] No importer registered for op: BatchedNMS_TRT. Attempting to import as plugin.
[12/07/2023-11:13:36] [I] [TRT] Searching for plugin: BatchedNMS_TRT, plugin_version: 1, plugin_namespace: 
[12/07/2023-11:13:36] [W] [TRT] builtin_op_importers.cpp:5221: Attribute caffeSemantics not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build.
[12/07/2023-11:13:36] [I] [TRT] Successfully created plugin: BatchedNMS_TRT
[12/07/2023-11:13:36] [I] Finished parsing network model. Parse time: 0.109849
[12/07/2023-11:13:36] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[12/07/2023-11:13:36] [I] [TRT] Graph optimization time: 0.0413027 seconds.
[12/07/2023-11:13:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +520, GPU +248, now: CPU 1566, GPU 554 (MiB)
[12/07/2023-11:13:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +76, now: CPU 1680, GPU 630 (MiB)
[12/07/2023-11:13:39] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[12/07/2023-11:13:39] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[12/07/2023-11:19:57] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[12/07/2023-11:19:57] [I] [TRT] Total Host Persistent Memory: 264160
[12/07/2023-11:19:57] [I] [TRT] Total Device Persistent Memory: 2225664
[12/07/2023-11:19:57] [I] [TRT] Total Scratch Memory: 16455168
[12/07/2023-11:19:57] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 8 MiB, GPU 261 MiB
[12/07/2023-11:19:57] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 108 steps to complete.
[12/07/2023-11:19:57] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 3.78352ms to assign 6 blocks to 108 nodes requiring 23264768 bytes.
[12/07/2023-11:19:57] [I] [TRT] Total Activation Memory: 23264768
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +32, now: CPU 1736, GPU 700 (MiB)
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +34, now: CPU 1736, GPU 734 (MiB)
[12/07/2023-11:19:57] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[12/07/2023-11:19:57] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[12/07/2023-11:19:57] [W] [TRT] Check verbose logs for the list of affected weights.
[12/07/2023-11:19:57] [W] [TRT] - 59 weights are affected by this issue: Detected subnormal FP16 values.
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +6, GPU +9, now: CPU 6, GPU 9 (MiB)
[12/07/2023-11:19:57] [I] Engine built in 387.52 sec.
[12/07/2023-11:19:57] [I] [TRT] Loaded engine size: 7 MiB
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +34, now: CPU 967, GPU 456 (MiB)
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +32, now: CPU 967, GPU 488 (MiB)
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8, now: CPU 0, GPU 8 (MiB)
[12/07/2023-11:19:57] [I] Engine deserialized in 0.0255395 sec.
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +32, now: CPU 967, GPU 458 (MiB)
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +32, now: CPU 967, GPU 490 (MiB)
[12/07/2023-11:19:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +24, now: CPU 0, GPU 32 (MiB)
[12/07/2023-11:19:57] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[12/07/2023-11:19:57] [I] Setting persistentCacheLimit to 0 bytes.
[12/07/2023-11:19:57] [I] Using random values for input input_0
[12/07/2023-11:19:57] [I] Input binding for input_0 with dimensions 1x3x640x640 is created.
[12/07/2023-11:19:57] [I] Output binding for num_detections with dimensions 1 is created.
[12/07/2023-11:19:57] [I] Output binding for nmsed_boxes with dimensions 1x100x4 is created.
[12/07/2023-11:19:57] [I] Output binding for nmsed_scores with dimensions 1x100 is created.
[12/07/2023-11:19:57] [I] Output binding for nmsed_classes with dimensions 1x100 is created.
[12/07/2023-11:19:57] [I] Starting inference
[12/07/2023-11:20:01] [I] Warmup completed 25 queries over 200 ms
[12/07/2023-11:20:01] [I] Timing trace has 375 queries over 3.02251 s
[12/07/2023-11:20:01] [I] 
[12/07/2023-11:20:01] [I] === Trace details ===
[12/07/2023-11:20:01] [I] Trace averages of 10 runs:
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04018 ms - Host latency: 9.65986 ms (enqueue 0.391347 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.06021 ms - Host latency: 9.68235 ms (enqueue 0.38476 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.06596 ms - Host latency: 9.68654 ms (enqueue 0.382184 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02823 ms - Host latency: 9.64688 ms (enqueue 0.38764 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04249 ms - Host latency: 9.65966 ms (enqueue 0.397418 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.05179 ms - Host latency: 9.67533 ms (enqueue 0.394867 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03177 ms - Host latency: 9.6528 ms (enqueue 0.441193 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.05048 ms - Host latency: 9.67842 ms (enqueue 0.485565 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03894 ms - Host latency: 9.65827 ms (enqueue 0.413574 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04352 ms - Host latency: 9.66728 ms (enqueue 0.410986 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02842 ms - Host latency: 9.65074 ms (enqueue 0.415857 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03427 ms - Host latency: 9.65969 ms (enqueue 0.448169 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03347 ms - Host latency: 9.65881 ms (enqueue 0.503564 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02837 ms - Host latency: 9.6521 ms (enqueue 0.459741 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03608 ms - Host latency: 9.65798 ms (enqueue 0.472925 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04578 ms - Host latency: 9.67347 ms (enqueue 0.42821 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.0418 ms - Host latency: 9.66664 ms (enqueue 0.411084 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.0325 ms - Host latency: 9.65337 ms (enqueue 0.432385 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04359 ms - Host latency: 9.67275 ms (enqueue 0.436475 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03551 ms - Host latency: 9.66028 ms (enqueue 0.373267 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04448 ms - Host latency: 9.67483 ms (enqueue 0.427051 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.0207 ms - Host latency: 9.64279 ms (enqueue 0.394702 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.01709 ms - Host latency: 9.636 ms (enqueue 0.372339 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02876 ms - Host latency: 9.6515 ms (enqueue 0.361438 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03325 ms - Host latency: 9.65352 ms (enqueue 0.396729 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02439 ms - Host latency: 9.64522 ms (enqueue 0.366895 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.04578 ms - Host latency: 9.66819 ms (enqueue 0.362061 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02949 ms - Host latency: 9.64702 ms (enqueue 0.364844 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02734 ms - Host latency: 9.65269 ms (enqueue 0.356421 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02673 ms - Host latency: 9.64539 ms (enqueue 0.38606 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02336 ms - Host latency: 9.64465 ms (enqueue 0.36145 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02446 ms - Host latency: 9.65596 ms (enqueue 0.358228 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03501 ms - Host latency: 9.6562 ms (enqueue 0.406079 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03921 ms - Host latency: 9.66177 ms (enqueue 0.394385 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.02356 ms - Host latency: 9.64351 ms (enqueue 0.366382 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03716 ms - Host latency: 9.66223 ms (enqueue 0.369971 ms)
[12/07/2023-11:20:01] [I] Average on 10 runs - GPU latency: 8.03315 ms - Host latency: 9.65959 ms (enqueue 0.407007 ms)
[12/07/2023-11:20:01] [I] 
[12/07/2023-11:20:01] [I] === Performance summary ===
[12/07/2023-11:20:01] [I] Throughput: 124.069 qps
[12/07/2023-11:20:01] [I] Latency: min = 9.50659 ms, max = 9.77417 ms, mean = 9.65802 ms, median = 9.65796 ms, percentile(90%) = 9.6972 ms, percentile(95%) = 9.7081 ms, percentile(99%) = 9.7489 ms
[12/07/2023-11:20:01] [I] Enqueue Time: min = 0.335449 ms, max = 0.94751 ms, mean = 0.403062 ms, median = 0.392822 ms, percentile(90%) = 0.45575 ms, percentile(95%) = 0.490234 ms, percentile(99%) = 0.693237 ms
[12/07/2023-11:20:01] [I] H2D Latency: min = 1.5907 ms, max = 1.65784 ms, mean = 1.60357 ms, median = 1.60181 ms, percentile(90%) = 1.61365 ms, percentile(95%) = 1.61865 ms, percentile(99%) = 1.62708 ms
[12/07/2023-11:20:01] [I] GPU Compute Time: min = 7.88892 ms, max = 8.16174 ms, mean = 8.03518 ms, median = 8.03516 ms, percentile(90%) = 8.06912 ms, percentile(95%) = 8.08142 ms, percentile(99%) = 8.11987 ms
[12/07/2023-11:20:01] [I] D2H Latency: min = 0.0136719 ms, max = 0.03479 ms, mean = 0.0192687 ms, median = 0.0185547 ms, percentile(90%) = 0.0229492 ms, percentile(95%) = 0.0241699 ms, percentile(99%) = 0.032959 ms
[12/07/2023-11:20:01] [I] Total Host Walltime: 3.02251 s
[12/07/2023-11:20:01] [I] Total GPU Compute Time: 3.01319 s
[12/07/2023-11:20:01] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/07/2023-11:20:01] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=yolo_export/yolov8n/yolov8n_end2end.onnx --saveEngine=yolo_export/yolov8n/yolov8n_end2end.engine --workspace=1024 --fp16
2023-12-07 11:20:01.702 | INFO     | __main__:end2end:217 - yolo_export/yolov8n/yolov8n_end2end.engine is saved for c++ inference
2023-12-07 11:20:01.805 | INFO     | __main__:end2end:232 - yolo_export/yolov8n/yolov8n_end2end.pt is saved for python inference

LSH9832 / yolov8_trt

Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64 #7