152334H commented 1 year ago

I am currently attempting to port a llama-like model architecture from pure pytorch to TransformerEngine's pytorch classes.

However, I have been unable to obtain identical results in certain cases.

from transformer_engine.pytorch import (
  Linear as LinearTE,
  RMSNorm as RMSNormTE,
  LayerNormMLP, LayerNormLinear,
  TransformerLayer
)
from torch import nn, Tensor
import torch.nn.functional as F
import torch

What works

Linear

hasBias = False
l = nn.Linear(100, 5, bias=hasBias)
l_te = Linear(100, 5, bias=hasBias)
l.weight.data = l_te.weight_tensor
if hasBias: l.bias.data = l_te.bias_tensor

Linear layers are precisely accurate:

>>> r = torch.rand(100).cuda()
>>> torch.allclose(l(r), l_te(r))
True

RMSNorm

Seems accurate after this PR

What seems different

LayerNormMLP

Consider this simple implementation of an MLP with RMSNorm:

dim = 512
mid = 1536
class MLP2(nn.Module):
    def __init__(self, dim: int, mid_features: int):
        super().__init__()
        self.norm = RMSNormTE(dim)
        self.gate_proj = nn.Linear(dim, mid_features, bias=False)
        self.up_proj = nn.Linear(dim, mid_features, bias=False)
        self.down_proj = nn.Linear(mid_features, dim, bias=False)

    def forward(self, x: Tensor) -> Tensor:
        x = self.norm(x)
        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))

mlp_te = LayerNormMLP(dim, mid, bias=False, normalization='RMSNorm', activation='swiglu').cuda()
mlp_2 = MLP2(dim, mid).cuda()

My understanding of the LayerNormMLP's implementation of swiglu is that it keeps the gate proj && up proj weights fused in fc1. So I try to mimic this in MLP2 by copying the weights:

upgate_fused = mlp_te.fc1_weight
mlp_2.gate_proj.weight.data, mlp_2.up_proj.weight.data = upgate_fused.split(upgate_fused.shape[0]//2)
mlp_2.down_proj.weight.data = mlp_te.fc2_weight

When I do this, the results are not identical:

>>> emb = torch.rand((2,dim*2,dim)).cuda()
>>> (mlp_2(emb) - mlp_te(emb)).sum()
tensor(-0.1887, device='cuda:0', grad_fn=<SumBackward0>)

I tried flipping the order of the gate/up weights, but this made it worse:

>>> mlp_2.gate_proj.weight.data, mlp_2.up_proj.weight.data = upgate_fused.split(upgate_fused.shape[0]//2)[::-1]
>>> (mlpln2(emb) - mlplnte(emb)).sum()
tensor(-140.2129, device='cuda:0', grad_fn=<SumBackward0>)

I also tried using mlp_te.set_activation_dtype(torch.bfloat16), but this seemed to have 0 effect.

Attention

I also experienced a similar total error of ~tensor(1.3594, device='cuda:0') versus a normal implementation of self-attention, but I would like like to debug the LayerNormMLP difference first (a self-attention implementation would take a lot of space 😦)

ptrendx commented 1 year ago

@ksivaman Could you take a look?