Describe the Bug
when input data is on cuda:1, fused_layer_norm_cuda.rms_forward_affine some times, not always gives following error
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
The error occurs always when it is called from T5ForConditionalGeneration model. reference https://github.com/huggingface/transformers/issues/26323
It doesn't appear when cuda:0 is used.
Minimal Steps/Code to Reproduce the Bug
Sometimes (around 40% of times) got RuntimeError about illegal memory access by following code
```Python
import importlib
import torch
from torch.nn import functional
from torch import cuda, tensor
device = 'cuda:1' if cuda.is_available() else 'cpu'
global fused_layer_norm_cuda
fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
input_ = tensor([[[-0.0000, -0.0000, -0.6244, 1.2299],
[ 0.6517, 0.6030, -0.7394, -0.8782],
[-0.2892, -0.5522, 0.3603, -0.4705],
[-0.2892, -0.5522, 0.3603, -0.0000]]]).to(device)
normalized_shape = torch.Size([4])
weight_ = tensor([1., 1., 1., 1.]).to(device)
weight_.requires_grad_()
eps = 1e-06
output, invvar = fused_layer_norm_cuda.rms_forward_affine(
input_, normalized_shape, weight_, eps
)
print(output)
```
```python-traceback
Traceback (most recent call last):
File "/home/qai/Text2SQL/kushdesh/generate sql statements/generate_sql_from_q/train_with_own_head/test2.py", line 22, in
print(output)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 430, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 669, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 600, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 352, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 137, in __init__
nonzero_finite_vals = torch.masked_select(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
**Expected Behavior**
The code should run without error on whether device is cuda:0 or cuda:1
Environment
PyTorch version: 2.1.0a0+29c30b1
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-33-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
Nvidia driver version: 535.113.01
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.1.0a0+29c30b1
[pip3] torch-tensorrt==2.0.0.dev0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.16.0a0
[pip3] triton==2.1.0+440fd1b
[conda] Could not collect
Describe the Bug when input data is on cuda:1, fused_layer_norm_cuda.rms_forward_affine some times, not always gives following error
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
The error occurs always when it is called from T5ForConditionalGeneration model. reference https://github.com/huggingface/transformers/issues/26323 It doesn't appear when cuda:0 is used.Minimal Steps/Code to Reproduce the Bug
Sometimes (around 40% of times) got RuntimeError about illegal memory access by following code ```Python import importlib import torch from torch.nn import functional from torch import cuda, tensor device = 'cuda:1' if cuda.is_available() else 'cpu' global fused_layer_norm_cuda fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") input_ = tensor([[[-0.0000, -0.0000, -0.6244, 1.2299], [ 0.6517, 0.6030, -0.7394, -0.8782], [-0.2892, -0.5522, 0.3603, -0.4705], [-0.2892, -0.5522, 0.3603, -0.0000]]]).to(device) normalized_shape = torch.Size([4]) weight_ = tensor([1., 1., 1., 1.]).to(device) weight_.requires_grad_() eps = 1e-06 output, invvar = fused_layer_norm_cuda.rms_forward_affine( input_, normalized_shape, weight_, eps ) print(output) ``` ```python-traceback Traceback (most recent call last): File "/home/qai/Text2SQL/kushdesh/generate sql statements/generate_sql_from_q/train_with_own_head/test2.py", line 22, inThe code should run without error on whether device is cuda:0 or cuda:1 Environment
PyTorch version: 2.1.0a0+29c30b1 Is debug build: False CUDA used to build PyTorch: 12.2 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.27.1 Libc version: glibc-2.35
Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.2.0-33-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 4090
Nvidia driver version: 535.113.01
Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.22.2 [pip3] pytorch-quantization==2.1.2 [pip3] torch==2.1.0a0+29c30b1 [pip3] torch-tensorrt==2.0.0.dev0 [pip3] torchdata==0.7.0a0 [pip3] torchtext==0.16.0a0 [pip3] torchvision==0.16.0a0 [pip3] triton==2.1.0+440fd1b [conda] Could not collect