A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
This PR added unit tests for distributed fused_layernorm_fp8_mlp and LayerNormMLP.
The outputs of the run on multiple GPUs are compared again to the one with a single GPU for correctness check.
Type of change
[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
This PR added unit tests for distributed
fused_layernorm_fp8_mlp
andLayerNormMLP
. The outputs of the run on multiple GPUs are compared again to the one with a single GPU for correctness check.Type of change
Checklist: