NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Apache License 2.0
10.55k stars 2.1k forks source link

Xid 31 error in TensorRT when running two CudaGraph captured ExecutionContexts concurrently on RTX 4070 or RTX A4500 #4061

Open soooch opened 1 month ago

soooch commented 1 month ago


I have two TensorRT plans compiled from ONNX using the standard TensorRT builder and ONNX parser.

I can successfully capture the ExecutionContexts derived from these plans to CudaGraphs and launch these on Streams (with outputs as expected).

However, when launching these operations repeatedly in a loop, and if certain conditions are met, we will eventually encounter a Xid 31 error after an arbitrary, large number of loop iterations. This error manifests itself in the program as a cuda error 700 (illegal memory access) when synchronizing the first stream.

The following conditions must all be true to trigger the error:

compute-sanitizer (all tools) and cuda-memcheck (all tools) report no problems. The issue doesn't seem to pop up when running with cuda-gdb. when CUDA_LAUNCH_BLOCKING=1 is used, the error is still received when synchronizing.


TensorRT Version: GPU Type: tested with RTX 4070 and RTX A4500 Nvidia Driver Version: 550.78 (RTX 4070) or 525.60.13 (RTX A4500) CUDA Version: tested with 11.8 and 12.3.2 CUDNN Version: 8.9.7 Operating System + Version: tested with linux 6.6 and linux 6.1 Python Version (if applicable): N/A TensorFlow Version (if applicable): N/A PyTorch Version (if applicable): N/A Baremetal or Container (if container which image + tag): tested on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 and nvcr.io/nvidia/tensorrt:24.01-py3

Relevant Files


Steps To Reproduce

git clone git@github.com:soooch/weird-trt-thing.git
cd weird-trt-thing
docker run --gpus all -it --rm -v .:/workspace nvcr.io/nvidia/tensorrt:24.01-py3

once inside container:

apt update
apt-get install -y parallel


# need at least 2, but will fail faster if more (hence 16)
parallel -j0 --delay 0.3 ./fuzzer ::: {1..16}
# wait up to ~ 10 minutes. usually much faster
soooch commented 1 month ago

This issue has also been posted to the Nvidia Developer Forums: https://forums.developer.nvidia.com/t/xid-31-error-when-two-cudagraph-captured-executioncontexts-are-executed-concurrently/302553/1

soooch commented 1 month ago

https://github.com/NVIDIA/TensorRT/issues/3633 sounds very similar.

@zerollzeng @oxana-nvidia any chance we could get confirmation on this being the same issue? And if so, is there any news on a fix?

oxana-nvidia commented 1 month ago

yes, I think it is the same issue. I still don't have information which cuda version is planned to have the fix.