Open sriniiyer opened 2 months ago
The simplest solution would be to manually construct an MLP out of multiple te.Linear
s, but this won't be able to do all of the kernel fusions in te.LayerNormMLP
.
Long-term, this kind of customization is the purpose of the operation-based API being developed in https://github.com/NVIDIA/TransformerEngine/pull/707:
mlp = te.Sequential(
te.ops.Linear(...),
te.ops.GeLU(),
te.ops.Linear(...),
)
Is there currently a way to use MLP without applying the LayerNorm? What would be the best way to implement this? Thanks!