A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Currently, our causal masks are aligned to the top left corner of the softmax matrix, but in inference/KV caching, users often need to align them to the bottom right corner. This PR adds two mask types, causal_bottom_right and padding_causal_bottom_right, to support this new alignment. The old mask types, causal and padding_causal, are still used to denote the top-left alignment.
This PR also extracts and streamlines the utility function get_attention_backend() for backend availability test. Users can call it with their model params/runtime environment to check which backends are available to support a particular set of user input, and which backend will be selected based on TransformerEngine's internal logic.
To facilitate the addition of bottom-right causal masks, this PR also makes two other changes for decoders (PR #895 ).
improves check_set_window_size function to make window_size consistent with attn_mask_type
adds enc_dec_attn_mask_type and enc_dec_window_size parameters to TransformerLayer so encoder MHA call and decoder MHA call are separated.
Type of change
[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refractor
Changes
Please list the changes introduced in this PR:
Add bottom-right-diagonal support for causal mask on both C and PyTorch levels
Add utility function get_attention_backend to help with availability tests
Add mask type and window size params for decoder situations
Description
Currently, our causal masks are aligned to the top left corner of the softmax matrix, but in inference/KV caching, users often need to align them to the bottom right corner. This PR adds two mask types,
causal_bottom_right
andpadding_causal_bottom_right
, to support this new alignment. The old mask types,causal
andpadding_causal
, are still used to denote the top-left alignment.The new support matrix is,
This PR also extracts and streamlines the utility function
get_attention_backend()
for backend availability test. Users can call it with their model params/runtime environment to check which backends are available to support a particular set of user input, and which backend will be selected based on TransformerEngine's internal logic.To facilitate the addition of bottom-right causal masks, this PR also makes two other changes for decoders (PR #895 ).
check_set_window_size
function to makewindow_size
consistent withattn_mask_type
enc_dec_attn_mask_type
andenc_dec_window_size
parameters toTransformerLayer
so encoder MHA call and decoder MHA call are separated.Type of change
Changes
Please list the changes introduced in this PR:
get_attention_backend
to help with availability testsChecklist: