NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.85k stars 310 forks source link

[JAX] Made order of gated act consistent in all branches #902

Closed phu0ngng closed 3 months ago

phu0ngng commented 3 months ago

Description

This PR changes the order of the gated activation call in LayerNormMLP so that it is consistent in all conditional branches. This order is important for checkpointing as a miss-order may cause drops in training accuracy for LLama.

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

[x] I have read and followed the contributing guidelines
[ ] The functionality is complete
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

phu0ngng commented 3 months ago

/te-ci jax