Open jokla opened 1 month ago
I have the same error when trying to build a TRT file from an ONNX file. On the same OS version, driver version 560.35.03 and using TensorRT 10.4.
Also, when I'm trying to load a TRT file that was built last week, before upgrading to GCC 13, and I'm getting this stack trace:
#0 0x00007ffff7ea516e in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1 0x00007ffff7ea5d5a in _Unwind_Find_FDE () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2 0x00007ffff7ea160a in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#3 0x00007ffff7ea307d in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#4 0x00007ffff7cb705b in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fffbec3c0e5 in ?? () from /usr/lib/x86_64-linux-gnu//libnvinfer.so.10
#6 0x00007fffbecf24c6 in ?? () from /usr/lib/x86_64-linux-gnu//libnvinfer.so.10
#7 0x00007fffbecf41ac in ?? () from /usr/lib/x86_64-linux-gnu//libnvinfer.so.10
#8 0x00007fffbecf53ca in ?? () from /usr/lib/x86_64-linux-gnu//libnvinfer.so.10
#9 0x00007fffd32d6e7a in std::default_delete<nvinfer1::IRuntime>::operator() (this=0x7fffad7fdc58, __ptr=0x7fff740040b0) at /usr/include/c++/13/bits/unique_ptr.h:99
#10 0x00007fffd32d5cae in std::unique_ptr<nvinfer1::IRuntime, std::default_delete<nvinfer1::IRuntime> >::~unique_ptr (this=0x7fffad7fdc58, __in_chrg=<optimized out>)
at /usr/include/c++/13/bits/unique_ptr.h:404
#11 0x00007fffd32d2919 in loadCudaEngine (trtPath="my_file.trt", logger=warning: RTTI symbol not found for class 'TensorRTLogger'
...)
at loading_trt.cpp:170
This happens when deallocating a IRuntime
after deserializeCudaEngine
was called, otherwise it can be deleted (don't know if anything changes in IRuntime, but before upgrading to gcc-13, this was working fine, I also tried keeping the IRuntime alive a bit longer, but it just crashes further in the process).
When using memcheck
, matching with this stack trace above, there's this message:
Use of uninitialised value of size 8
I was able to fix my bug above my keeping the IRunTime alive longer than the engine. Then I had a secondary logic bug (which is why the first time I tried that it didn't work). But this wasn't needed before, or it just kept going.
As for creating the TRT file from and an .onnx
file, the crash above for me only happens if config->setBuilderOptimizationLevel(5);
is called. Ignored, or any optimization level below 5
prevents it from crashing.
Hope this helps diagnosing the problem.
Please see here for the supported GCC versions on each platform https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#software-version-platform
Does setting -D_GLIBCXX_USE_CXX11_ABI=1
help at all to resolve this issue?
@zeroepoch I built TRT with GLIBCXX_USE_CXX11_ABI=1
, but I'm still experiencing segmentation faults with nvinfer—both when converting a model from ONNX using trtexec and during inference in holoinfer (holoscan operator using tensorrt).
I’m a bit puzzled about how setting GLIBCXX_USE_CXX11_ABI=1
could help with this. I thought that, by default, GCC 5 and later versions use CXX11_ABI=1
, and since trt is built with GCC 8, it shouldn’t be an issue?
@zeroepoch I thought that, by default, GCC 5 and later versions use
CXX11_ABI=1
, and since trt is built with GCC 8, it shouldn’t be an issue?
We force the older C++ ABI to increase compatibility with RHEL 7. That will be changing in a future release.
Could you try the latest release, TRT 10.6? We've officially supported Ubuntu 24.04 with GCC 13 the last few TRT releases.
Hi @zeroepoch! Many thanks for your support!
I created a repo with instructions on how to replicate the issue: https://github.com/jokla/trt_gcc13.
I added a vanilla YOLOv8n ONNX model from Ultralytics. It was generated with:
!pip install ultralytics
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
success = model.export(format="onnx")
I get a segmentation fault when I try to parse the model with FP16, reaching CaskConvolution[0x80000009].
[11/10/2024-17:27:44] [V] [TRT] /models.0/backbone/backbone/dark2/dark2.1/m/m.0/conv2/conv/Conv + PWN(PWN(PWN(/models.0/backbone/backbone/dark2/dark2.1/m/m.0/conv2/act/Sigmoid), PWN(/models.0/backbone/backbone/dark2/dark2.1/m/m.0/conv2/act/Mul)), PWN(/models.0/backbone/backbone/dark2/dark2.1/m/m.0/Add)) (CaskConvolution[0x80000009]) profiling completed in 0.370127 seconds. Fastest Tactic: 0x0866ddee325d07a6 Time: 0.0348142
However, I discovered that trtexec also crashes when I run it with an incorrect parameter like trtexec --test.
Tested the following:
I don't think we can easily move to Ubuntu 24.04 since we are using Nvidia Holoscan, so I have tried to avoid installing GCC 13 from apt (Ubuntu 22.04 version is GCC 13.1). Instead, I tried to build GCC 13.2 from source and use it on the tensorrt:24.8
image with TensorRT updated to 10.6. It looks like it is working for now, but having to build GCC from scratch is not ideal only because of TRT.
I am not sure why TRT is not happy about GCC 13.1 installed by apt. I haven't found a reason yet. Maybe there is something that got fixed in gcc 13.2? This is the list: https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&resolution=FIXED&target_milestone=13.2
Hi @jokla,
I was able to reproduce your problem. Thank you for the very detailed repo step! I'm not exactly sure where the problem is being introduced, but I can speculate that it's due to libgcc
or libstdc++
being upgraded as part of the GCC 13.1 install. Best I can tell there is some breakage with the previously compiled trtexec
and these new libraries. I didn't try installing the libstdc++
binary from GCC 13.2 to confirm, so it's still speculation.
I was able to find a workaround by rebuilding trtexec. Both the invalid argument case and the original model you're trying to convert work without crashing. I added to your existing Docker container with the following Dockerfile
.
FROM trt_10_6_24_10_gcc13
ENV DEBIAN_FRONTEND=noninteractive
RUN make -C /workspace/tensorrt/samples clean
RUN make -C /workspace/tensorrt/samples samples=trtexec
RUN cp -f /workspace/tensorrt/bin/trtexec /opt/tensorrt/bin/trtexec
Within this new container the following commands work:
docker run --gpus all -it --rm -v ./data:/data trt_10_6_24_10_gcc13_rebuild /usr/bin/bash -c "trtexec --onnx=/data/yolov8n.onnx --fp16 --verbose"
docker run --gpus all -it --rm trt_10_6_24_10_gcc13_rebuild /usr/bin/bash -c "trtexec --test && pwd"
I want to also mention that 24.11, which will be released in a week or so will be based on Ubuntu 24.04, so it will have GCC 13.2 as you mentioned. Maybe this will help for your Holoscan situation?
Hi @zeroepoch ! Thanks for the update.
I tried to build trtexec as you suggested, the trtexec --test
is working but the conversion is still crashing for me:
#2 __GI___pthread_kill (threadid=135588370841600, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007b5120619476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007b51205ff7f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007b5120af96fd in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#6 0x00007b5120b0e857 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7 0x00007b5120b1007d in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8 0x00007b5120cb805b in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007b50e878f0be in ?? () from /lib/x86_64-linux-gnu/libnvinfer.so.10
Could you confirm that it is actually working for you?
docker run --gpus all -it --rm -v ./data:/data trt_10_6_24_10_gcc13_rebuild /usr/bin/bash -c "trtexec --onnx=/data/yolov8n.onnx --fp16 --verbose"
With this command, you don't get a segmentation fault message because it exits before printing anything to the terminal. If you add any command afterward, it will show a segmentation fault (at least for me).
docker run --gpus all -it --rm -v ./data:/data trt_10_6_24_10_gcc13_rebuild /usr/bin/bash -c "trtexec --onnx=/data/yolov8n.onnx --fp16 --verbose && ls"
Many thanks for your support!
Hi @jokla,
When running this command:
docker run --gpus all -it --rm -v ./data:/data trt_10_6_24_10_gcc13_rebuild /usr/bin/bash -c "trtexec --onnx=/data/yolov8n.onnx --fp16 --verbose"
It ends with:
[11/23/2024-07:29:13] [V] [TRT] /model.2/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.2/m.0/cv2/act/Sigmoid), PWN(/model.2/m.0/cv2/act/Mul)), PWN(/model.2/m.0/Add)) (CaskConvolution[0x80000009]) profiling completed in 0.6333 seconds. Fastest Tactic: 0xa5a46bfbd719d757 Time: 0.00910743
When running this command:
docker run --gpus all -it --rm -v ./data:/data trt_10_6_24_10_gcc13_rebuild /usr/bin/bash -c "trtexec --onnx=/data/yolov8n.onnx --fp16 --verbose && ls"
It ends with:
[11/23/2024-07:31:07] [V] [TRT] /model.2/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.2/m.0/cv2/act/Sigmoid), PWN(/model.2/m.0/cv2/act/Mul)), PWN(/model.2/m.0/Add)) (CaskConvolution[0x80000009]) profiling completed in 0.659246 seconds. Fastest Tactic: 0x69501656100171de Time: 0.00914057
/usr/bin/bash: line 1: 120 Aborted (core dumped) trtexec --onnx=/data/yolov8n.onnx --fp16 --verbose
As you mentioned it segfaults at the end. I wasn't seeing it before, but probably because the container ends before the error gets printed. I'll need to investigate further. Based on the backtrace it looks like a similar issue as before when trtexec
was precompiled, which means recompiling trtexec
results in the same problem eventually.
Since trtexec
works with both the default compiler from the 24.10 release and in an Ubuntu 24.04 container with its default compiler, I would have to agree with your observation that there is some compiler issue. I don't think TensorRT can resolve a compatibility issue with a particular version of GCC. Let's say we compiled TensorRT with GCC 13.1, it might not work with the default compiler in Ubuntu 22.04 or 24.04 anymore. I haven't tried this, but if this hypothesis is correct, then the solution here is to update the compiler from GCC 13.1 to one without a cross-version compatibility issue.
Description
Tensorrt seg fault when parsing an ONNX model ( yolov8 QAT) with gcc13 installed in Ubuntu 22.04.
Environment
TensorRT Version: 10.3 or olders
NVIDIA GPU: NVIDIA RTX A6000
NVIDIA Driver Version: 560.28.03
CUDA Version: 12.6
CUDNN Version: 8.9.6.50-1+cuda12.2
Operating System:
Container : ubuntu-22.04.Dockerfile + gcc 13 installed
Same issue with nvcr.io/nvidia/tensorrt:24.08-py3 with gcc13 installed on top of it.
Relevant Files
Steps To Reproduce
Commands or scripts:
It seems that the issue is coming from libnvinfer.so.10 and gcc13. The TRT open source version uses a prebuilt nvinfer (from https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.3.0/tars/TensorRT-10.3.0.26.Linux.x86_64-gnu.cuda-12.5.tar.gz) , possibly compiled with an older gcc (gcc 8 looking at this table ). The conversion is working on an Orin with Jetpack 6 ( probably because TRT is build with a newer gcc version).
How can I make TRT (and libnvinfer) compatible with gcc13? Also, is there a specific reason why it's only built with an old version of gcc?
Many thanks!
Have you tried the latest release?: Yes, same issue
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): Yes