Open alexdremov opened 2 weeks ago
Can you provide a minimal reproducer? The following runs for me:
import torch
import transformer_engine.pytorch as te
# Options
batch_size = 128
hidden_size = 128
dtype = torch.float32
device = torch.device("cuda")
# TE module
layer = te.RMSNorm(hidden_size, params_dtype=dtype, device=device)
# Synthetic data
x = torch.randn([batch_size, hidden_size], dtype=dtype, device=device, requires_grad=True)
# Forward and backward pass
y = layer(x)
y.sum().backward()
Can you provide a minimal reproducer? The following runs for me:
import torch import transformer_engine.pytorch as te # Options batch_size = 128 hidden_size = 128 dtype = torch.float32 device = torch.device("cuda") # TE module layer = te.RMSNorm(hidden_size, params_dtype=dtype, device=device) # Synthetic data x = torch.randn([batch_size, hidden_size], dtype=dtype, device=device, requires_grad=True) # Forward and backward pass y = layer(x) y.sum().backward()
Hey!
This appeared when I tried to use Fp8Tensor
. I'll try to write a minimal example, but this could be rather hard
While running RMS norm, I got the following exception:
However, clearly, all input tensors are allocated — I verified this with debugger