SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #50

Closed yxchng closed 2 years ago

yxchng commented 2 years ago
 ** On entry to SGEMM  parameter number 10 had an illegal value
Traceback (most recent call last):
  File "check_flops.py", line 34, in <module>
    model(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 620, in forward
    x = self.forward_features(x)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 616, in forward_features
    return self.features(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 434, in forward
    new_features = layer(features)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 392, in forward
    bottleneck_output = self.bottleneck_fn(prev_features)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 349, in bottleneck_fn
    bottleneck_output = self.block(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/models/natnext510_511.py", line 128, in forward
    x = self.attn(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/tiger/convnext/natten/nattencuda.py", line 121, in forward
    qkv = self.qkv(x).reshape(B, H, W, 3, self.num_heads, self.head_dim).permute(3, 0, 4, 1, 2, 5)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

My input shape is torch.Size([1, 256, 14, 14]). Why am I getting this error?

yxchng commented 2 years ago

Seems like it expects channels_last format. Will you support channels_first? Frequent permutation is quite a hit on performance.

alihassanijr commented 2 years ago

Thank you for your interest.

That's correct, the module expects channels-last inputs, similar to many other Attention implementations (although those are typically flattened across spatial axes and are 3D tensors).

Since the model is not a CNN, and Convolutions are the only operations in the model that require a channels first structure, it makes sense to keep inputs channels-last to avoid frequent permutations.

Some permutations and reshaping operations are unavoidable because of the multi-head split. The current structure is optimized for speed, as all linear projections expect channels-last, and our NA kernel expects inputs to be of shape Batch x Heads x Height x Width x Dim, which is again the most efficient way of computing outputs.

To be clear, channels-last is usually the more efficient format, hence the movement towards channels-last integration since torch 1.11. Although other factors also affect how much faster your operations get when switching to channels-last (to name a few: cuDNN kernels being called instead of naive ATEN kernels, architecture-specific kernels from NVIDIA packages).

But thank you for bringing this to our attention, shape requirements should be included in our documentation. We will update this in future versions.

alihassanijr commented 2 years ago

Closing this due to inactivity. If you still have questions feel free to open it back up.