Closed BrettRyland closed 8 months ago
[E] 1: [softMaxV2Runner.cpp::execute::226] Error Code 1: Cask (shader run failed)
Looks like an env issue(related to CUDA I think) here, could you please try use our official docker image?
Quick try: https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/convert/01_int8_calibration_in_tensorrt in our official docker.
The int8 calibration example works correctly both in and out of the docker image (ignore the timing values as I was compiling code at the same time).
Using the official docker image with pytorch and torchvision built from the latest commits (217b37c023d and 7ba3d7e202 respectively), I still get this Error Code 1: Cask (shader run failed)
error when performing int8 calibration and the number of detected outputs is incorrect for int8 calibration.
Looks like a cuda conflict between pytorch and tensorrt. we have pytorch docker in NGC, could you pease try that? Thanks!
That doesn't appear to make any difference.
I've put together a repro script (int8_calibration_bug.py) that causes the Cask (shader run failed)
error in the pytorch docker from NGC.
Unlike the full model though, the correct number of outputs are detected in this case, so that might be a separate issue.
The int8 calibration example works correctly both in and out of the docker image (ignore the timing values as I was compiling code at the same time).
If polygraphy works well then this seems a bug in your code. Actually you can use polygraphy to do the calibration and build the engine. Could you please try this?
If polygraphy works well then this seems a bug in your code. Actually you can use polygraphy to do the calibration and build the engine. Could you please try this?
I will try this soon, some other more pressing matters came up in the meantime.
Did you try running the repro script I posted above that causes the Cask (shader run failed)
error? It doesn't cause the incorrect number of outputs detected issue, but it does cause that error, so it's something you should be able to test locally.
If polygraphy works well then this seems a bug in your code. Actually you can use polygraphy to do the calibration and build the engine. Could you please try this?
Based on the filename of where the error occurs (softMaxV2Runner.cpp), I expect that this error originates from the onnx equivalent of the torch.nn.functional.softmax
call in the region proposal section of the network (in fwd_rpn
in the repro script).
An even simpler repro script: int8_calibration_bug_simpler_repro.py, run with python -- int8_calibration_bug_simpler_repro.py -q
Here the model (which uses non-sensical data, but represents part of a region proposal section of a full RPN-style network) is simply
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.input_size = [3, 64, 64]
def forward(self, x):
N = x.shape[0]
scores = x.reshape(N, -1, 2, 2)[:, :, 0]
boxes = x.reshape(N, -1, 4)
prediction_mask = [torch.argmax(score, dim=1) for score in scores.unbind(dim=0)]
proposals = [box[index] for box, index in zip(boxes.unbind(dim=0), prediction_mask)] # Select the foreground boxes as proposals.
scores = [score[mask, 0] for score, mask in zip(torch.nn.functional.softmax(scores, dim=-1).unbind(dim=0), prediction_mask)] # Get their normalised scores.
keeps = [torchvision.ops.nms(boxes, scores, iou_threshold=0.7)[:20] for boxes, scores in zip(proposals, scores)] # Perform non-maximum suppression on each item in the batch. Take only the top 20 from each.
boxes = [proposal[keep] for proposal, keep in zip(proposals, keeps)] # Commenting out this line prevents the "[softMaxV2Runner.cpp::execute::226] Error Code 1: Cask (shader run failed)" error.
return x, scores, boxes
The [softMaxV2Runner.cpp::execute::226] Error Code 1: Cask (shader run failed)
error occurs due to the final boxes = ...
line of the forward
function.
I've seen a similar issue recently, what's your GPU driver version?
I can reproduce the issue with int8_calibration_bug_simpler_repro.py. I'll file internal bug to track this later.
I've seen a similar issue recently, what's your GPU driver version?
The driver version has been updated since the original post, which was Nvidia driver version: 535.104.05
. It's currently 535.113.01
:
$ sudo inxi --graphics
Graphics:
Device-1: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 535.113.01
Device-2: Logitech HD Pro Webcam C920 driver: snd-usb-audio,uvcvideo
type: USB
Display: server: X.Org v: 1.21.1.7 with: Xwayland v: 23.2.0 driver: X:
loaded: nvidia unloaded: fbdev,modesetting,nouveau,vesa
gpu: nvidia,nvidia-nvswitch resolution: 1: 3840x2160~60Hz
2: 3840x2160~60Hz 3: 3840x2160~60Hz 4: 3840x2160~60Hz
API: OpenGL v: 4.6.0 NVIDIA 535.113.01 renderer: NVIDIA GeForce RTX
4090/PCIe/SSE2
Thanks, I've filed internal bug 4340507 to track this, sorry about the delayed response, quite busy with other things these day :-)
Issue is fixed in TRT 10. Closed.
[E] 1: [softMaxV2Runner.cpp::execute::226] Error Code 1: Cask (shader run failed)
usually binary compile not with right sm arch
Description
When trying to convert a PyTorch model to a TensorRT engine, int8 calibration fails with:
which I suspect is a result of the wrong number of output tensors being detected for the network.
Inspecting the onnx model that the engine is being built from with polygraphy correctly shows 1 input and 4 outputs:
``` brett@brett-home:~/Work/Autosensor/NN$ polygraphy inspect model /tmp/model-with-shapes.onnx [I] Loading model: /tmp/model-with-shapes.onnx [I] ==== ONNX Model ==== Name: main_graph | ONNX Opset: 17 ---- 1 Graph Input(s) ---- {input [dtype=float32, shape=(1, 3, 512, 896)]} ---- 4 Graph Output(s) ---- {scores [dtype=float32, shape=('Min(200, NonMaxSuppression_940_o0__d0)', 2)], boxes [dtype=float32, shape=('Min(200, NonMaxSuppression_940_o0__d0)', 4)], roi [dtype=float32, shape=('Min(200, NonMaxSuppression_940_o0__d0)', 5)], count [dtype=int64, shape=()]} ---- 163 Initializer(s) ---- ---- 999 Node(s) ---- ```The model is proprietary, so I can't share it and I don't currently have a minimal repro model, however, I can share the builder script (trt_builder.py) which is based on https://github.com/NVIDIA-AI-IOT/jetson_dla_tutorial#step-7. I've also checked that calibration with the script works if a toy model (with just a single convolution layer) is used to verify that the
DatasetCalibrator
class works.When building the TensorRT engine without quantization (i.e., without int8 calibration), the correct number of network outputs is detected and the engine is built successfully.
``` brett@brett-home:~/Work/Autosensor/NN$ python -- trt_builder.py "saved/RPN_ThunderNet2-activation:BN-ReLU-classes:2-input:3x512x896-complexity:0-statnett-0.5-2023-08-01" /home/brett/Work/Autosensor/.direnv/python-venv-3.11.5/lib/python3.11/site-packages/torch/__init__.py:1418: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert condition, message /home/brett/Work/Autosensor/.direnv/python-venv-3.11.5/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /tmp/build-via-sdist-9wtz2njt/torch-2.2.0a0+gitfaf3de3/aten/src/ATen/native/TensorShape.cpp:3549.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] [09/20/2023-16:59:23] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 537, GPU 4117 (MiB) [09/20/2023-16:59:26] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1445, GPU +266, now: CPU 2058, GPU 4383 (MiB) [09/20/2023-16:59:26] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [09/20/2023-16:59:26] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output. [09/20/2023-16:59:26] [TRT] [W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped [09/20/2023-16:59:26] [TRT] [I] Graph optimization time: 0.0266299 seconds. [09/20/2023-16:59:26] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2080, GPU 4391 (MiB) [09/20/2023-16:59:26] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2080, GPU 4401 (MiB) [09/20/2023-16:59:26] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored. [09/20/2023-17:00:37] [TRT] [I] Detected 1 inputs and 4 output network tensors. [09/20/2023-17:00:37] [TRT] [I] Total Host Persistent Memory: 453328 [09/20/2023-17:00:37] [TRT] [I] Total Device Persistent Memory: 18432 [09/20/2023-17:00:37] [TRT] [I] Total Scratch Memory: 1205248 [09/20/2023-17:00:37] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 7 MiB, GPU 105 MiB [09/20/2023-17:00:37] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 178 steps to complete. [09/20/2023-17:00:37] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 4.81692ms to assign 10 blocks to 178 nodes requiring 27747328 bytes. [09/20/2023-17:00:37] [TRT] [I] Total Activation Memory: 27747328 [09/20/2023-17:00:37] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2330, GPU 4427 (MiB) [09/20/2023-17:00:37] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2331, GPU 4437 (MiB) [09/20/2023-17:00:37] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy. [09/20/2023-17:00:37] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. [09/20/2023-17:00:37] [TRT] [W] Check verbose logs for the list of affected weights. [09/20/2023-17:00:37] [TRT] [W] - 61 weights are affected by this issue: Detected subnormal FP16 values. [09/20/2023-17:00:37] [TRT] [W] - 3 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value. [09/20/2023-17:00:37] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +6, GPU +10, now: CPU 6, GPU 10 (MiB) ``` and polygraphy gives sensible output for the resulting engine: ``` brett@brett-home:~/Work/Autosensor/NN$ polygraphy inspect model /tmp/model.engine [I] Loading bytes from /tmp/model.engine [I] ==== TensorRT Engine ==== Name: Unnamed Network 0 | Explicit Batch Engine ---- 1 Engine Input(s) ---- {input [dtype=float32, shape=(1, 3, 512, 896)]} ---- 4 Engine Output(s) ---- {roi [dtype=float32, shape=(-1, 5)], scores [dtype=float32, shape=(-1, 2)], boxes [dtype=float32, shape=(-1, 4)], count [dtype=int32, shape=()]} ---- Memory ---- Device Memory: 27747328 bytes ---- 1 Profile(s) (5 Tensor(s) Each) ---- - Profile: 0 Tensor: input (Input), Index: 0 | Shapes: min=(1, 3, 512, 896), opt=(1, 3, 512, 896), max=(1, 3, 512, 896) Tensor: roi (Output), Index: 1 | Shape: (-1, 5) Tensor: scores (Output), Index: 2 | Shape: (-1, 2) Tensor: boxes (Output), Index: 3 | Shape: (-1, 4) Tensor: count (Output), Index: 4 | Shape: () ---- 215 Layer(s) ---- ```Environment
Environment
Modified `collect_env.py` script from PyTorch to include `tensorrt` in the pip packages: ``` brett@brett-home:~/Work/Autosensor/NN$ python collect_env.py Collecting environment information... PyTorch version: 2.2.0a0 Is debug build: False CUDA used to build PyTorch: 12.2 ROCM used to build PyTorch: N/A OS: Ubuntu Mantic Minotaur (development branch) (x86_64) GCC version: (Ubuntu 13.2.0-4ubuntu1) 13.2.0 Clang version: 16.0.6 (15) CMake version: version 3.27.4 Libc version: glibc-2.38 Python version: 3.11.5 (main, Aug 29 2023, 15:31:31) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-6.5.1-060501-generic-x86_64-with-glibc2.38 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 Nvidia driver version: 535.104.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i9-13900K CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 83% CPU max MHz: 5800.0000 CPU min MHz: 800.0000 BogoMIPS: 5990.40 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 896 KiB (24 instances) L1i cache: 1.3 MiB (24 instances) L2 cache: 32 MiB (12 instances) L3 cache: 36 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.24.2 [pip3] numpy-quaternion==2022.4.3 [pip3] tensorrt==8.6.1.post1 [pip3] tensorrt-bindings==8.6.1 [pip3] tensorrt-libs==8.6.1 [pip3] torch==2.2.0a0+a683bc5 [pip3] torchaudio==2.0.2 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.17.0a0+4cb3d80 [pip3] triton==2.0.0 [conda] Could not collect ```