NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
My code calls TRT inference for an onnx model with dynamic input shape. The shape of the input to model vary at almost every inference call. I can see the memory usage increase on jtop and when I leave the inference call to an infinite loop it takes a few hours to crash due to out of memory.
I tried cuda-memcheck but it returns 0 leak and 0 error.
I also tried valgrind and the definite leakages occurred related to TRT are
==4904== 56 bytes in 1 blocks are definitely lost in loss record 744 of 2,378
==4904== at 0x48461D0: malloc (vg_replace_malloc.c:381)
==4904== by 0x434E6827: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434DE1A3: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x430BD637: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434DA8CF: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434D8C07: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434D8D3F: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434DFA1B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434E015B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434D4F77: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x43058A63: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x4304FBEF: __cuda_CallJitEntryPoint (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== 5,909 (72 direct, 5,837 indirect) bytes in 1 blocks are definitely lost in loss record 2,089 of 2,378
==4904== at 0x48461D0: malloc (vg_replace_malloc.c:381)
==4904== by 0x434E6827: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x434DE1A3: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x430BD6EB: nvPTXCompilerCreate (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.32.4.3)
==4904== by 0x2C123817: fatBinaryCtl_Compile_WithJITDir (in /usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.32.4.3)
==4904== by 0x2B30B61F: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==4904== by 0x2B30C91F: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==4904== by 0x2B2B446B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==4904== by 0x63CCF27: ??? (in /usr/local/cuda-10.2/targets/aarch64-linux/lib/libcudart.so.10.2.89)
My code calls TRT inference for an onnx model with dynamic input shape. The shape of the input to model vary at almost every inference call. I can see the memory usage increase on jtop and when I leave the inference call to an infinite loop it takes a few hours to crash due to out of memory.
I tried cuda-memcheck but it returns 0 leak and 0 error. I also tried valgrind and the definite leakages occurred related to TRT are
Here is my TX1 info: