Misaligned address failure of TensorRT 10.5 when building engine with `trtexec` on RTX 2060 and RTX 2070 SUPER

kokostek commented 2 weeks ago

Probably dup of #3956. If that is the case - sorry for spamming, but anyway:

Description

We have encountered misaligned address error when we were trying to build engine from onnx model.

By trial and error we have managed to pinpoint the source of the problem to a single Conv2D layer with very specific parameters: more-than-one number of groups and bias enabled. Also, --fp16 option for trtexec seams to trigger this error, i.e. FP32 engine builds just fine.

Error message is following:

[10/02/2024-16:32:50] [E] Error[1]: [builderUtils.cpp::nvinfer1::builder::CommonRunnerProfiler::executeAndTimeIters::<lambda_d8116b4e82a61eaeb8dfd1b9ed18449d>::operator ()::928] Error Code 1: Cuda Runtime (misaligned address)

We were able to reproduce this on RTX 2060 and RTX 2070 SUPER. At the same time, RTX 3070 successfully produces an engine.

Environment

TensorRT Version: 10.5.0

NVIDIA GPU: RTX 2060

NVIDIA Driver Version: 555.85

CUDA Version: 12.5.1

CUDNN Version: -

Operating System: Windows 10

Python Version (if applicable): 3.10.10

Tensorflow Version (if applicable): -

PyTorch Version (if applicable): 2.0.1

Baremetal or Container (if so, version): bare

Relevant Files

I've shared model files along with sources and logs in this repo: https://github.com/kokostek/TensorRT_Misaligned_Address

Steps To Reproduce

Make a bunch of onnx files with this python script:


import torch
import torch.onnx
from torch import nn

class SampleModel(nn.Module):

def __init__(self, *, bias: bool, groups: int):

    super().__init__()

    self.conv = nn.Conv2d(
        in_channels=256, out_channels=16,
        kernel_size=3, stride=1, padding=1,
        bias=bias, groups=groups)

def forward(self, x):
    return self.conv(x)

def main():

configurations = [
    {'bias': False, 'groups': 1},
    {'bias': False, 'groups': 16},
    {'bias': True, 'groups': 1},
    {'bias': True, 'groups': 16},
]

for args in configurations:

    model = SampleModel(**args)
    dummy_input = torch.randn(1, 256, 4, 4)
    onnx_file = f'bias={args["bias"]}_groups={args["groups"]}.onnx'

    torch.onnx.export(
        model, dummy_input, onnx_file,
        export_params=True,
        opset_version=11,
        do_constant_folding=True,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
    )

if name == 'main': main()


* For each model file try to build an engine:

trtexec --onnx=bias=False_groups=1.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16 trtexec --onnx=bias=False_groups=16.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16 trtexec --onnx=bias=True_groups=1.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16 trtexec --onnx=bias=True_groups=16.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16


* On RTX 2060 and 2070 SUPER the last build attempt (the one with `bias=True_groups=16.onnx`) should fail with this log:

&&&& RUNNING TensorRT.trtexec [TensorRT v100500] [b18] # trtexec.exe --onnx=bias=True_groups=16.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16 [10/02/2024-16:32:48] [I] === Model Options === [10/02/2024-16:32:48] [I] Format: ONNX [10/02/2024-16:32:48] [I] Model: bias=True_groups=16.onnx [10/02/2024-16:32:48] [I] Output: [10/02/2024-16:32:48] [I] === Build Options === [10/02/2024-16:32:48] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default [10/02/2024-16:32:48] [I] avgTiming: 8 [10/02/2024-16:32:48] [I] Precision: FP32+FP16 [10/02/2024-16:32:48] [I] LayerPrecisions: [10/02/2024-16:32:48] [I] Layer Device Types: [10/02/2024-16:32:48] [I] Calibration: [10/02/2024-16:32:48] [I] Refit: Disabled [10/02/2024-16:32:48] [I] Strip weights: Disabled [10/02/2024-16:32:48] [I] Version Compatible: Disabled [10/02/2024-16:32:48] [I] ONNX Plugin InstanceNorm: Disabled [10/02/2024-16:32:48] [I] TensorRT runtime: full [10/02/2024-16:32:48] [I] Lean DLL Path: [10/02/2024-16:32:48] [I] Tempfile Controls: { in_memory: allow, temporary: allow } [10/02/2024-16:32:48] [I] Exclude Lean Runtime: Disabled [10/02/2024-16:32:48] [I] Sparsity: Disabled [10/02/2024-16:32:48] [I] Safe mode: Disabled [10/02/2024-16:32:48] [I] Build DLA standalone loadable: Disabled [10/02/2024-16:32:48] [I] Allow GPU fallback for DLA: Disabled [10/02/2024-16:32:48] [I] DirectIO mode: Disabled [10/02/2024-16:32:48] [I] Restricted mode: Disabled [10/02/2024-16:32:48] [I] Skip inference: Disabled [10/02/2024-16:32:48] [I] Save engine: model.engine [10/02/2024-16:32:48] [I] Load engine: [10/02/2024-16:32:48] [I] Profiling verbosity: 0 [10/02/2024-16:32:48] [I] Tactic sources: Using default tactic sources [10/02/2024-16:32:48] [I] timingCacheMode: local [10/02/2024-16:32:48] [I] timingCacheFile: [10/02/2024-16:32:48] [I] Enable Compilation Cache: Enabled [10/02/2024-16:32:48] [I] errorOnTimingCacheMiss: Disabled [10/02/2024-16:32:48] [I] Preview Features: Use default preview flags. [10/02/2024-16:32:48] [I] MaxAuxStreams: -1 [10/02/2024-16:32:48] [I] BuilderOptimizationLevel: -1 [10/02/2024-16:32:48] [I] MaxTactics: -1 [10/02/2024-16:32:48] [I] Calibration Profile Index: 0 [10/02/2024-16:32:48] [I] Weight Streaming: Disabled [10/02/2024-16:32:48] [I] Runtime Platform: Same As Build [10/02/2024-16:32:48] [I] Debug Tensors: [10/02/2024-16:32:48] [I] Input(s)s format: fp32:CHW [10/02/2024-16:32:48] [I] Output(s)s format: fp32:CHW [10/02/2024-16:32:48] [I] Input build shape (profile 0): input=16x256x4x4+16x256x4x4+16x256x4x4 [10/02/2024-16:32:48] [I] Input calibration shapes: model [10/02/2024-16:32:48] [I] === System Options === [10/02/2024-16:32:48] [I] Device: 0 [10/02/2024-16:32:48] [I] DLACore: [10/02/2024-16:32:48] [I] Plugins: [10/02/2024-16:32:48] [I] setPluginsToSerialize: [10/02/2024-16:32:48] [I] dynamicPlugins: [10/02/2024-16:32:48] [I] ignoreParsedPluginLibs: 0 [10/02/2024-16:32:48] [I] [10/02/2024-16:32:48] [I] === Inference Options === [10/02/2024-16:32:48] [I] Batch: Explicit [10/02/2024-16:32:48] [I] Input inference shape : input=16x256x4x4 [10/02/2024-16:32:48] [I] Iterations: 10 [10/02/2024-16:32:48] [I] Duration: 3s (+ 200ms warm up) [10/02/2024-16:32:48] [I] Sleep time: 0ms [10/02/2024-16:32:48] [I] Idle time: 0ms [10/02/2024-16:32:48] [I] Inference Streams: 1 [10/02/2024-16:32:48] [I] ExposeDMA: Disabled [10/02/2024-16:32:48] [I] Data transfers: Enabled [10/02/2024-16:32:48] [I] Spin-wait: Disabled [10/02/2024-16:32:48] [I] Multithreading: Disabled [10/02/2024-16:32:48] [I] CUDA Graph: Disabled [10/02/2024-16:32:48] [I] Separate profiling: Disabled [10/02/2024-16:32:48] [I] Time Deserialize: Disabled [10/02/2024-16:32:48] [I] Time Refit: Disabled [10/02/2024-16:32:48] [I] NVTX verbosity: 0 [10/02/2024-16:32:48] [I] Persistent Cache Ratio: 0 [10/02/2024-16:32:48] [I] Optimization Profile Index: 0 [10/02/2024-16:32:48] [I] Weight Streaming Budget: 100.000000% [10/02/2024-16:32:48] [I] Inputs: [10/02/2024-16:32:48] [I] Debug Tensor Save Destinations: [10/02/2024-16:32:48] [I] === Reporting Options === [10/02/2024-16:32:48] [I] Verbose: Disabled [10/02/2024-16:32:48] [I] Averages: 10 inferences [10/02/2024-16:32:48] [I] Percentiles: 90,95,99 [10/02/2024-16:32:48] [I] Dump refittable layers:Disabled [10/02/2024-16:32:48] [I] Dump output: Disabled [10/02/2024-16:32:48] [I] Profile: Disabled [10/02/2024-16:32:48] [I] Export timing to JSON file: [10/02/2024-16:32:48] [I] Export output to JSON file: [10/02/2024-16:32:48] [I] Export profile to JSON file: [10/02/2024-16:32:48] [I] [10/02/2024-16:32:48] [I] === Device Information === [10/02/2024-16:32:48] [I] Available Devices: [10/02/2024-16:32:48] [I] Device 0: "NVIDIA GeForce RTX 2060" UUID: GPU-4c0e9779-cf6e-e9e1-efe5-0f2749008685 [10/02/2024-16:32:48] [I] Selected Device: NVIDIA GeForce RTX 2060 [10/02/2024-16:32:48] [I] Selected Device ID: 0 [10/02/2024-16:32:48] [I] Selected Device UUID: GPU-4c0e9779-cf6e-e9e1-efe5-0f2749008685 [10/02/2024-16:32:48] [I] Compute Capability: 7.5 [10/02/2024-16:32:48] [I] SMs: 30 [10/02/2024-16:32:48] [I] Device Global Memory: 6143 MiB [10/02/2024-16:32:48] [I] Shared Memory per SM: 64 KiB [10/02/2024-16:32:48] [I] Memory Bus Width: 192 bits (ECC disabled) [10/02/2024-16:32:48] [I] Application Compute Clock Rate: 1.2 GHz [10/02/2024-16:32:48] [I] Application Memory Clock Rate: 7.001 GHz [10/02/2024-16:32:48] [I] [10/02/2024-16:32:48] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at. [10/02/2024-16:32:48] [I] [10/02/2024-16:32:48] [I] TensorRT version: 10.5.0 [10/02/2024-16:32:48] [I] Loading standard plugins [10/02/2024-16:32:48] [I] [TRT] [MemUsageChange] Init CUDA: CPU +6, GPU +0, now: CPU 3924, GPU 1036 (MiB) [10/02/2024-16:32:50] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1376, GPU +190, now: CPU 5622, GPU 1226 (MiB) [10/02/2024-16:32:50] [I] Start parsing network model. [10/02/2024-16:32:50] [I] [TRT] ---------------------------------------------------------------- [10/02/2024-16:32:50] [I] [TRT] Input filename: bias=True_groups=16.onnx [10/02/2024-16:32:50] [I] [TRT] ONNX IR version: 0.0.6 [10/02/2024-16:32:50] [I] [TRT] Opset version: 11 [10/02/2024-16:32:50] [I] [TRT] Producer name: pytorch [10/02/2024-16:32:50] [I] [TRT] Producer version: 2.0.1 [10/02/2024-16:32:50] [I] [TRT] Domain:
[10/02/2024-16:32:50] [I] [TRT] Model version: 0 [10/02/2024-16:32:50] [I] [TRT] Doc string:
[10/02/2024-16:32:50] [I] [TRT] ---------------------------------------------------------------- [10/02/2024-16:32:50] [I] Finished parsing network model. Parse time: 0.0588976 [10/02/2024-16:32:50] [I] Set shape of input tensor input for optimization profile 0 to: MIN=16x256x4x4 OPT=16x256x4x4 MAX=16x256x4x4 [10/02/2024-16:32:50] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [10/02/2024-16:32:50] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32. [10/02/2024-16:32:50] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.

[10/02/2024-16:32:50] [E] Error[1]: [resizingAllocator.cpp::nvinfer1::internal::ResizingAllocator::deallocate::114] Error Code 1: Cuda Runtime (misaligned address) [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x84062d29cac28548 due to exception cudaEventElapsedTime [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x8cb7f21c884843f4 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x44f0ab120cdb95df due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0xb33dfebb05c33935 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x651002a8d73048a1 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x1679a8ed82d4c75d due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x92dd5701de28e44b due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x682cff76ba5f2886 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0xbe01036568ac5912 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003e8 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003ea due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003e8 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003ea due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003e8 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003ea due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003e8 due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x00000000000003ea due to exception misaligned address [10/02/2024-16:32:50] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception misaligned address [10/02/2024-16:32:50] [I] [TRT] Detected 1 inputs and 1 output network tensors. [10/02/2024-16:32:50] [E] Error[1]: IBuilder::buildSerializedNetwork: Error Code 1: Cuda Runtime (no further information) [10/02/2024-16:32:50] [E] Engine could not be created from network [10/02/2024-16:32:50] [E] Building engine failed [10/02/2024-16:32:50] [E] Failed to create engine from model or file. [10/02/2024-16:32:50] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v100500] [b18] # trtexec.exe --onnx=bias=True_groups=16.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16

lix19937 commented 1 week ago

I use trt 8.5/ 8.6 both passed.

[10/05/2024-14:44:33] [I] === Performance summary ===
[10/05/2024-14:44:33] [I] Throughput: 9162.87 qps
[10/05/2024-14:44:33] [I] Latency: min = 0.0471191 ms, max = 0.431885 ms, mean = 0.0622692 ms, median = 0.0487061 ms, percentile(90%) = 0.0931396 ms, percentile(95%) = 0.109863 ms, percentile(99%) = 0.13501 ms
[10/05/2024-14:44:33] [I] Enqueue Time: min = 0.0124512 ms, max = 0.4375 ms, mean = 0.0198779 ms, median = 0.0175171 ms, percentile(90%) = 0.0211182 ms, percentile(95%) = 0.0256348 ms, percentile(99%) = 0.0980225 ms
[10/05/2024-14:44:33] [I] H2D Latency: min = 0.0231323 ms, max = 0.20166 ms, mean = 0.0298494 ms, median = 0.0236816 ms, percentile(90%) = 0.0453644 ms, percentile(95%) = 0.0466309 ms, percentile(99%) = 0.0515137 ms
[10/05/2024-14:44:33] [I] GPU Compute Time: min = 0.0202637 ms, max = 0.285645 ms, mean = 0.0252489 ms, median = 0.0212402 ms, percentile(90%) = 0.0408936 ms, percentile(95%) = 0.0439453 ms, percentile(99%) = 0.068634 ms
[10/05/2024-14:44:33] [I] D2H Latency: min = 0.00341797 ms, max = 0.31311 ms, mean = 0.00717106 ms, median = 0.00378418 ms, percentile(90%) = 0.0234375 ms, percentile(95%) = 0.0256348 ms, percentile(99%) = 0.0421143 ms
[10/05/2024-14:44:33] [I] Total Host Walltime: 3.00015 s
[10/05/2024-14:44:33] [I] Total GPU Compute Time: 0.694092 s
[10/05/2024-14:44:33] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/05/2024-14:44:33] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/05/2024-14:44:33] [W] * Throughput may be bound by host-to-device transfers for the inputs rather than GPU Compute and the GPU may be under-utilized.
[10/05/2024-14:44:33] [W]   Add --noDataTransfers flag to disable data transfers.
[10/05/2024-14:44:33] [W] * GPU compute time is unstable, with coefficient of variance = 42.7117%.
[10/05/2024-14:44:33] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/05/2024-14:44:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/05/2024-14:44:33] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8510] # trtexec --onnx=bias=True_groups=16.onnx --saveEngine=model.engine --minShapes=input:16x256x4x4 --optShapes=input:16x256x4x4 --maxShapes=input:16x256x4x4 --fp16

My machine/gpu info

[10/05/2024-14:46:04] [I] === Device Information ===
[10/05/2024-14:46:04] [I] Selected Device: NVIDIA RTX 2000 Ada Generation Laptop GPU
[10/05/2024-14:46:04] [I] Compute Capability: 8.9
[10/05/2024-14:46:04] [I] SMs: 24
[10/05/2024-14:46:04] [I] Compute Clock Rate: 2.115 GHz
[10/05/2024-14:46:04] [I] Device Global Memory: 8187 MiB
[10/05/2024-14:46:04] [I] Shared Memory per SM: 100 KiB
[10/05/2024-14:46:04] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/05/2024-14:46:04] [I] Memory Clock Rate: 8.001 GHz
[10/05/2024-14:46:04] [I]
[10/05/2024-14:46:04] [I] TensorRT version: 8.5.10

Can you try to use other version of trt ?

Cuda Runtime (misaligned address) error has many reasons, like libcuda.so bug or nvinfer.so bug.

kokostek commented 1 week ago

Originally, we encountered this failure on 10.2, but we've switched to 10.5 to make this report. I guess we need to test different versions of cuda then...

Also, I want to point out that we only have Windows machines at our disposal right now, so I don't know if this reproduces on Linux OSes.

UPD @lix19937 I've just spotted that your GPU differs from ours generation-wise. As I mentioned in original message, this bug only reproduces on 2060 and 2070, and it does not on 3070. So, I guess, this answers the question why your tests have passed.

lix19937 commented 1 week ago

Can you try it on linux os ? I no rtx2060/2070 gpu env. @kokostek

kokostek commented 1 week ago

Yes, we could, but it will take a while.

NVIDIA / TensorRT