NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.6k stars 255 forks source link

[BUG] Assertion failed: t.data.dptr != nullptr. Input x is not allocated! #935

Open alexdremov opened 2 weeks ago

alexdremov commented 2 weeks ago

While running RMS norm, I got the following exception:

/workspace/TransformerEngine/transformer_engine/common/transformer_engine.cpp:39 in function CheckInputTensor: Assertion failed: t.data.dptr != nullptr. Input x is not allocated!
  File "/usr/local/lib/python3.11/dist-packages/transformer_engine/pytorch/module/rmsnorm.py", line 50, in forward
    rmsnorm_out, rsigma = tex.rmsnorm_fwd(inputmat, rmsnorm_weight,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

However, clearly, all input tensors are allocated — I verified this with debugger

timmoon10 commented 2 weeks ago

Can you provide a minimal reproducer? The following runs for me:

import torch
import transformer_engine.pytorch as te

# Options
batch_size = 128
hidden_size = 128
dtype = torch.float32
device = torch.device("cuda")

# TE module
layer = te.RMSNorm(hidden_size, params_dtype=dtype, device=device)

# Synthetic data
x = torch.randn([batch_size, hidden_size], dtype=dtype, device=device, requires_grad=True)

# Forward and backward pass
y = layer(x)
y.sum().backward()
alexdremov commented 22 hours ago

Can you provide a minimal reproducer? The following runs for me:

import torch
import transformer_engine.pytorch as te

# Options
batch_size = 128
hidden_size = 128
dtype = torch.float32
device = torch.device("cuda")

# TE module
layer = te.RMSNorm(hidden_size, params_dtype=dtype, device=device)

# Synthetic data
x = torch.randn([batch_size, hidden_size], dtype=dtype, device=device, requires_grad=True)

# Forward and backward pass
y = layer(x)
y.sum().backward()

Hey! This appeared when I tried to use Fp8Tensor. I'll try to write a minimal example, but this could be rather hard