Cuda Runtime (out of memory) failure of TensorRT 10.3.0 when running trtexec on GPU RTX4060/jetson/etc

Description

I'm trying to convert yoloV8-seg model to TensorRT engine, I'm using DeepStream-Yolo-Seg for converting the model to onnx. after running trtexec with the converted onnx file I'm getting this errors:

[11/22/2024-13:40:08] [I] Finished parsing network model. Parse time: 0.0704936
[11/22/2024-13:40:08] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/22/2024-13:40:47] [E] Error[1]: [defaultAllocator.cpp::allocate::31] Error Code 1: Cuda Runtime (out of memory)
[11/22/2024-13:40:47] [W] [TRT] Requested amount of GPU memory (15485030400 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/22/2024-13:40:47] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [tunable_graph.cpp:create:117] autotuning: User allocator error allocating 15485030400-byte buffer
[11/22/2024-13:40:47] [E] Error[10]: IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/1/Constant_36_output_0.../1/Slice_4]}.)
[11/22/2024-13:40:47] [E] Engine could not be created from network
[11/22/2024-13:40:47] [E] Building engine failed
[11/22/2024-13:40:47] [E] Failed to create engine from model or file.
[11/22/2024-13:40:47] [E] Engine set up failed

with TensorRT 10.0.0.6-1+cuda11.8 the engine can be created, but anything newer it fails.

Environment

TensorRT Version: 10.3.0 NVIDIA GPU: RTX4060 NVIDIA Driver Version: 565.57.01 CUDA Version: 12.6 CUDNN Version: 9.5.1.17-1

Operating System: Ubuntu 22.04.5 LTS Python Version (if applicable): 3.10.12-1~22.04.7 PyTorch Version (if applicable): 2.5.1

Steps To Reproduce

 python export_yoloV8_seg.py --weights yolov8s-seg.pt
/usr/src/tensorrt/bin/trtexec --onnx=yolov8s-seg.onnx

NVIDIA / TensorRT