A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
The optimimzed RMSNorm kernels access data with vectorized loads/stores, and can thus experience alignment issues if the data pointers are misaligned (e.g. they are views within a larger buffer). This PR falls back to unoptimized kernels (with entry-wise memory accesses) if data pointers are not aligned.
Description
The optimimzed RMSNorm kernels access data with vectorized loads/stores, and can thus experience alignment issues if the data pointers are misaligned (e.g. they are views within a larger buffer). This PR falls back to unoptimized kernels (with entry-wise memory accesses) if data pointers are not aligned.
This is the same fix as https://github.com/NVIDIA/TransformerEngine/pull/490.
Type of change
Changes
Checklist: