fused_layer_norm_cuda.rms_forward_affine gives runtime error when run on cuda:1

Describe the Bug when input data is on cuda:1, fused_layer_norm_cuda.rms_forward_affine some times, not always gives following error RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. The error occurs always when it is called from T5ForConditionalGeneration model. reference https://github.com/huggingface/transformers/issues/26323 It doesn't appear when cuda:0 is used.

Minimal Steps/Code to Reproduce the Bug

Sometimes (around 40% of times) got RuntimeError about illegal memory access by following code ```Python import importlib import torch from torch.nn import functional from torch import cuda, tensor device = 'cuda:1' if cuda.is_available() else 'cpu' global fused_layer_norm_cuda fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") input_ = tensor([[[-0.0000, -0.0000, -0.6244, 1.2299], [ 0.6517, 0.6030, -0.7394, -0.8782], [-0.2892, -0.5522, 0.3603, -0.4705], [-0.2892, -0.5522, 0.3603, -0.0000]]]).to(device) normalized_shape = torch.Size([4]) weight_ = tensor([1., 1., 1., 1.]).to(device) weight_.requires_grad_() eps = 1e-06 output, invvar = fused_layer_norm_cuda.rms_forward_affine( input_, normalized_shape, weight_, eps ) print(output) ``` ```python-traceback Traceback (most recent call last): File "/home/qai/Text2SQL/kushdesh/generate sql statements/generate_sql_from_q/train_with_own_head/test2.py", line 22, in print(output) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 430, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 669, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 600, in _str_intern tensor_str = _tensor_str(self, indent) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 352, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 137, in __init__ nonzero_finite_vals = torch.masked_select( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` **Expected Behavior**

The code should run without error on whether device is cuda:0 or cuda:1 Environment

PyTorch version: 2.1.0a0+29c30b1 Is debug build: False CUDA used to build PyTorch: 12.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.27.1 Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.2.0-33-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 535.113.01

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.22.2 [pip3] pytorch-quantization==2.1.2 [pip3] torch==2.1.0a0+29c30b1 [pip3] torch-tensorrt==2.0.0.dev0 [pip3] torchdata==0.7.0a0 [pip3] torchtext==0.16.0a0 [pip3] torchvision==0.16.0a0 [pip3] triton==2.1.0+440fd1b [conda] Could not collect

NVIDIA / apex

fused_layer_norm_cuda.rms_forward_affine gives runtime error when run on cuda:1 #1736