cyrusbehr / YOLOv8-TensorRT-CPP

YOLOv8 TensorRT C++ Implementation
MIT License
568 stars 70 forks source link

Unable to generate seg engine trained with Ultralytics version 8.0.183 #27

Open YumainOB opened 1 year ago

YumainOB commented 1 year ago

Hello and thank you for your good job in bringing Yolov8 to the TensorRT C++ side.

I would like to help if it is possible but for now I'm facing an issue with the engine creation in case of a segmentation model. It seems that there is a missing stuff for "ConvTranspose_178 (CaskDeconvolution)" if I don't missunderstand logs.

I run the code on a TX2 board (with branch feat/jetson-tx2 obviously) Here is the jetson environment: $ jetson_release Software part of jetson-stats 4.2.3 - (c) 2023, Raffaello Bonghi Model: quill - Jetpack 4.6.4 [L4T 32.7.4] NV Power Mode[0]: MAXN Serial Number: [XXX Show with: jetson_release -s XXX] Hardware:

Here is the command I use: ./benchmark --model yolov8n_seg.onnx --input ~/workspace/ppanto_yolo/test_ressources --precision FP16 --class-names class1 class2

Here are the relevant pat of the logs.

--------------- Timing Runner: ConvTranspose_178 (CudnnDeconvolution)
CudnnDeconvolution has no valid tactics for this config, skipping
--------------- Timing Runner: ConvTranspose_178 (GemmDeconvolution)
Tactic: 0 skipped. Scratch requested: 8192000, available: 0
Fastest Tactic: -3360065831133338131 Time: inf
--------------- Timing Runner: ConvTranspose_178 (CaskDeconvolution)
CaskDeconvolution has no valid tactics for this config, skipping
*************** Autotuning format combination: Float(409600,1,5120,64) -> Float(1638400,1,10240,64) ***************
--------------- Timing Runner: ConvTranspose_178 (CudnnDeconvolution)
CudnnDeconvolution has no valid tactics for this config, skipping
--------------- Timing Runner: ConvTranspose_178 (GemmDeconvolution)
GemmDeconvolution has no valid tactics for this config, skipping
--------------- Timing Runner: ConvTranspose_178 (CaskDeconvolution)
CaskDeconvolution has no valid tactics for this config, skipping
*************** Autotuning format combination: Half(409600,6400,80,1) -> Half(1638400,25600,160,1) ***************
--------------- Timing Runner: ConvTranspose_178 (CudnnDeconvolution)
CudnnDeconvolution has no valid tactics for this config, skipping
--------------- Timing Runner: ConvTranspose_178 (GemmDeconvolution)
Tactic: 0 skipped. Scratch requested: 4096000, available: 0
Fastest Tactic: -3360065831133338131 Time: inf
--------------- Timing Runner: ConvTranspose_178 (CaskDeconvolution)
CaskDeconvolution has no valid tactics for this config, skipping
*************** Autotuning format combination: Half(204800,6400:2,80,1) -> Half(819200,25600:2,160,1) ***************
--------------- Timing Runner: ConvTranspose_178 (CudnnDeconvolution)
CudnnDeconvolution has no valid tactics for this config, skipping
--------------- Timing Runner: ConvTranspose_178 (GemmDeconvolution)
Tactic: 0 skipped. Scratch requested: 4096000, available: 0
Fastest Tactic: -3360065831133338131 Time: inf
--------------- Timing Runner: ConvTranspose_178 (CaskDeconvolution)
CaskDeconvolution has no valid tactics for this config, skipping
Deleting timing cache: 1496 entries, 2612 hits
10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node ConvTranspose_178.)
2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error: Unable to build the TensorRT engine. Try increasing TensorRT log severity to kVERBOSE (in /libs/tensorrt-cpp-api/engine.cpp).
Aborted (core dumped)

Do you have an idea of what I can do to get the model working right? What I don't understand is that I can export to engine using Ultralytics export and trtexec. Do you have a clue?

Best regards

HXB-1997 commented 1 year ago

I alse met the same question: nvidia@ubuntu:~/Desktop/HXB/11-4/YOLOv8-TensorRT-CPP/build$ ./detect_object_image --model /home/nvidia/Desktop/HXB/11-4/yolov8n-seg_sim.onnx --input ./bus2.jpg

Searching for engine file with name: yolov8n-seg_sim.engine.NVIDIATegraX2.fp16.1.1
Engine not found, generating. This could take a while...
onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Model only supports fixed batch size of 1

10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node ConvTranspose_177.)
2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
terminate called after throwing an instance of 'std::runtime_error'

  what():  Error: Unable to build the TensorRT engine. Try increasing TensorRT log severity to kVERBOSE (in /libs/tensorrt-cpp-api/engine.cpp).
Aborted (core dumped)

Do you solved it? @YumainOB @cyrusbehr

YumainOB commented 1 year ago

Sorry I still have no clue on this issue.

@cyrusbehr do you have some idea?

HXB-1997 commented 1 year ago

Sorry I still have no clue on this issue.

@cyrusbehr do you have some idea? I think convtranspose is supported by TensorRT 8.4, but current Jetpack Tensorrt version is 8.2, how can I upgrade tensorrt to 8.4 without upgrading Jetpack? @YumainOB

YumainOB commented 1 year ago

As far as I know, this is not possible to update TensorRT without upgrading the jetpack.

On the other side using Ultralytics repo and precisely "yolo export ..." using this jetpack/TensorRT without any issue, so I doubt that updating them is the only way to have the issue solved.

Best regards 

4e4o commented 1 year ago

I v'got the same issue. Here is extra logs from tensorrt 11.txt

YumainOB commented 11 months ago

I found a way to get the engine completely generated. Thanks to this post: https://forums.developer.nvidia.com/t/convtranspose-onnx-to-tensorrt-conversion-fail/181720/2. To apply this idea I added the following line in engine.cpp: config->setMaxWorkspaceSize(30); rigth after the IBuilderConfig creation and cheking.

That's a nice point

But I'm facing another issue later with a runtime failure... Here are the logs: CUDA_LAUNCH_BLOCKING=1 ./detect_object_image --model yolov8n_seg.onnx --input image.jpg Searching for engine file with name: yolov8n_seg.engine.NVIDIATegraX2.fp16.1.1 Engine found, not regenerating... [MemUsageChange] Init CUDA: CPU +266, GPU +0, now: CPU 301, GPU 7174 (MiB) Loaded engine size: 13 MiB Using cublas as a tactic source [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +169, now: CPU 475, GPU 7350 (MiB) Using cuDNN as a tactic source [MemUsageChange] Init cuDNN: CPU +250, GPU +252, now: CPU 725, GPU 7602 (MiB) Deserialization required 2294905 microseconds. [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12, now: CPU 0, GPU 12 (MiB) Using cublas as a tactic source [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 725, GPU 7602 (MiB) Using cuDNN as a tactic source [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 725, GPU 7602 (MiB) Total per-runner device persistent memory is 12509184 Total per-runner host persistent memory is 137424 Allocated activation device memory of size 14695424 [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +26, now: CPU 0, GPU 38 (MiB) 1: [reformat.cu::NCHHW2ToNCHW::1049] Error Code 1: Cuda Runtime (unspecified launch failure) terminate called after throwing an instance of 'std::runtime_error' what(): Error: Unable to run inference. Aborted (core dumped)

A similar message happens whith precision set to FP32: CUDA_LAUNCH_BLOCKING=1 ./detect_object_image --model ~/workspace/ppanto_yolo/yolov8n_seg.onnx --input image.jpg --precision FP32 Searching for engine file with name: yolov8n_seg.engine.NVIDIATegraX2.fp32.1.1 Engine found, not regenerating... [MemUsageChange] Init CUDA: CPU +266, GPU +0, now: CPU 315, GPU 6966 (MiB) Loaded engine size: 27 MiB Using cublas as a tactic source [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +170, now: CPU 489, GPU 7143 (MiB) Using cuDNN as a tactic source [MemUsageChange] Init cuDNN: CPU +250, GPU +251, now: CPU 739, GPU 7394 (MiB) Deserialization required 2305892 microseconds. [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +26, now: CPU 0, GPU 26 (MiB) Using cublas as a tactic source [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 739, GPU 7394 (MiB) Using cuDNN as a tactic source [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 739, GPU 7394 (MiB) Total per-runner device persistent memory is 27359232 Total per-runner host persistent memory is 129312 Allocated activation device memory of size 22171136 [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +47, now: CPU 0, GPU 73 (MiB) 1: [pointWiseV2Helpers.h::launchPwgenKernel::546] Error Code 1: Cuda Driver (unspecified launch failure) terminate called after throwing an instance of 'std::runtime_error' what(): Error: Unable to run inference. Aborted (core dumped)

@cyrusbehr Do you have any clue?