A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
[x] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
Changes
Add documentation, example scripts, benchmarks for dot product attention. Discuss the details of supported backends, their differences, and the nuances of other features such as QKV layouts, mask types, and FP8 attention.
Description
Add documentation for dot product attention.
Type of change
Changes
Checklist: