A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
I am currently attempting to port a llama-like model architecture from pure pytorch to TransformerEngine's pytorch classes.
However, I have been unable to obtain identical results in certain cases.
from transformer_engine.pytorch import (
Linear as LinearTE,
RMSNorm as RMSNormTE,
LayerNormMLP, LayerNormLinear,
TransformerLayer
)
from torch import nn, Tensor
import torch.nn.functional as F
import torch
My understanding of the LayerNormMLP's implementation of swiglu is that it keeps the gate proj && up proj weights fused in fc1. So I try to mimic this in MLP2 by copying the weights:
I also tried using mlp_te.set_activation_dtype(torch.bfloat16), but this seemed to have 0 effect.
Attention
I also experienced a similar total error of ~tensor(1.3594, device='cuda:0') versus a normal implementation of self-attention, but I would like like to debug the LayerNormMLP difference first (a self-attention implementation would take a lot of space 😦)
I am currently attempting to port a llama-like model architecture from pure pytorch to TransformerEngine's pytorch classes.
However, I have been unable to obtain identical results in certain cases.
What works
Linear
Linear layers are precisely accurate:
RMSNorm
Seems accurate after this PR
What seems different
LayerNormMLP
Consider this simple implementation of an MLP with RMSNorm:
My understanding of the LayerNormMLP's implementation of swiglu is that it keeps the gate proj && up proj weights fused in
fc1
. So I try to mimic this in MLP2 by copying the weights:When I do this, the results are not identical:
I tried flipping the order of the gate/up weights, but this made it worse:
I also tried using
mlp_te.set_activation_dtype(torch.bfloat16)
, but this seemed to have 0 effect.Attention
I also experienced a similar total error of ~
tensor(1.3594, device='cuda:0')
versus a normal implementation of self-attention, but I would like like to debug the LayerNormMLP difference first (a self-attention implementation would take a lot of space 😦)