Open haijieg opened 1 week ago
Usually u/int8_t and int16_t arrays are not usually aligned at 4 byte boundaries.
You can try forcing the appropriate alignment. Like change 'np.float16' to 'np.float32' in your affine_group_norm() and test_group_norm().
@lix19937 Here's a repro shows that even setting np.float32 for dummy_w
, dummy_b
, scale
, bias
, it still error with Error Code 1: Cuda Runtime (misaligned address)
.
import tensorrt as trt
import numpy as np
import operator
from functools import reduce
def affine_group_norm(network, x, num_groups, scale, bias, epsilon):
ranks = len(x.shape)
_shape = [1] * ranks
_shape[1] = num_groups
dummy_w = network.add_constant(_shape, np.ones(_shape, dtype=np.float32)).get_output(0)
dummy_b = network.add_constant(_shape, np.zeros(_shape, dtype=np.float32)).get_output(0)
axesMask = reduce(operator.or_, (1 << i for i in range(2, ranks)))
norm_layer = network.add_normalization(x, dummy_w, dummy_b, axesMask=axesMask)
norm_layer.num_groups = num_groups
norm_layer.epsilon = epsilon
output = norm_layer.get_output(0)
power = np.ones_like(scale)
scale_layer = network.add_scale(output,
trt.ScaleMode.CHANNEL,
shift=trt.Weights(bias),
scale=trt.Weights(scale),
power=trt.Weights(power))
scale_layer.channel_axis = 1
output = scale_layer.get_output(0)
return output
def test_group_norm(num_groups, fp16):
builder = trt.Builder(trt.Logger())
config = builder.create_builder_config()
if fp16:
config.flags |= 1 << int(trt.BuilderFlag.FP16)
config.flags |= 1 << int(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
shape = (2, 320, 64, 64)
network = builder.create_network(0)
x = network.add_input('x', trt.float16, shape)
scale = np.random.randn(shape[1]).astype(np.float32)
bias = np.random.randn(shape[1]).astype(np.float32)
y = affine_group_norm(network, x, num_groups, scale, bias, 1e-5)
network.mark_output(y)
engine = builder.build_serialized_network(network, config)
assert engine is not None
print("pass")
test_group_norm(32, fp16=False)
test_group_norm(10, fp16=False)
test_group_norm(32, fp16=True)
test_group_norm(10, fp16=True) # fails
Can you use torch api to export an onnx of affine_group_norm
, then use trtexec convert ? @haijieg
Can you use torch api to export an onnx of
affine_group_norm
, then use trtexec convert ? @haijieg
@lix19937 No, I want to directly control how network is built using TRT network/builder API instead of going through ONNX. This is the minimal repro without torch/onnx that shows unexpected behavior of TRT public API surface. I should not need to use onnx/torch to build a network correctly without crashing with Cuda Misalign Error.
Description
I believe this is regression of 10.1 on the normalization layer. It happens when FP16 mode is on and particular value of
num_groups
. When building a network with group norm in FP16 mode and a particular value of num_groups, which seems to be not a multiple of 8, the build failed with "Cuda Runtime (misaligned address)"Environment
TensorRT Version: 10.1
NVIDIA GPU: RTX 3070
NVIDIA Driver Version: 535.129.03
CUDA Version: 12.1
CUDNN Version: NA
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10
Baremetal or Container (if so, version): Baremetal
Relevant Files
Steps To Reproduce