Open soooch opened 1 month ago
This issue has also been posted to the Nvidia Developer Forums: https://forums.developer.nvidia.com/t/xid-31-error-when-two-cudagraph-captured-executioncontexts-are-executed-concurrently/302553/1
https://github.com/NVIDIA/TensorRT/issues/3633 sounds very similar.
@zerollzeng @oxana-nvidia any chance we could get confirmation on this being the same issue? And if so, is there any news on a fix?
yes, I think it is the same issue. I still don't have information which cuda version is planned to have the fix.
Description
I have two TensorRT plans compiled from ONNX using the standard TensorRT builder and ONNX parser.
I can successfully capture the
ExecutionContext
s derived from these plans toCudaGraph
s and launch these onStream
s (with outputs as expected).However, when launching these operations repeatedly in a loop, and if certain conditions are met, we will eventually encounter a Xid 31 error after an arbitrary, large number of loop iterations. This error manifests itself in the program as a cuda error 700 (illegal memory access) when synchronizing the first stream.
The following conditions must all be true to trigger the error:
ExectionContext
s must be captured to graphs.ExectionContext
s must be executing in parallel (on twoStream
s).compute-sanitizer (all tools) and cuda-memcheck (all tools) report no problems. The issue doesn't seem to pop up when running with cuda-gdb. when CUDA_LAUNCH_BLOCKING=1 is used, the error is still received when synchronizing.
Environment
TensorRT Version: 8.6.1.6 GPU Type: tested with RTX 4070 and RTX A4500 Nvidia Driver Version: 550.78 (RTX 4070) or 525.60.13 (RTX A4500) CUDA Version: tested with 11.8 and 12.3.2 CUDNN Version: 8.9.7 Operating System + Version: tested with linux 6.6 and linux 6.1 Python Version (if applicable): N/A TensorFlow Version (if applicable): N/A PyTorch Version (if applicable): N/A Baremetal or Container (if container which image + tag): tested on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 and nvcr.io/nvidia/tensorrt:24.01-py3
Relevant Files
https://github.com/soooch/weird-trt-thing
Steps To Reproduce
once inside container: