cyanguwa commented 1 week ago

Description

Currently, our causal masks are aligned to the top left corner of the softmax matrix, but in inference/KV caching, users often need to align them to the bottom right corner. This PR adds two mask types, causal_bottom_right and padding_causal_bottom_right, to support this new alignment. The old mask types, causal and padding_causal, are still used to denote the top-left alignment.

The new support matrix is,

attn_mask_type               |     supported backends
-------------------------------------------------------------------
no_mask                      |     All  
padding                      |     FlashAttention, FusedAttention
causal                       |    
    self-attention           |     All  
    cross-attention          |     FusedAttention
padding_causal               |    
    self-attention           |     FlashAttention, FusedAttention
    cross-attention          |     FusedAttention
causal_bottom_right          |     All  
padding_causal_bottom_right  |     FlashAttention, FusedAttention
arbitrary                    |     UnfusedDotProductAttention

This PR also extracts and streamlines the utility function get_attention_backend() for backend availability test. Users can call it with their model params/runtime environment to check which backends are available to support a particular set of user input, and which backend will be selected based on TransformerEngine's internal logic.

To facilitate the addition of bottom-right causal masks, this PR also makes two other changes for decoders (PR #895 ).

improves check_set_window_size function to make window_size consistent with attn_mask_type
adds enc_dec_attn_mask_type and enc_dec_window_size parameters to TransformerLayer so encoder MHA call and decoder MHA call are separated.

Type of change

[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Infra/Build change
[ ] Code refractor

Changes

Please list the changes introduced in this PR:

Add bottom-right-diagonal support for causal mask on both C and PyTorch levels
Add utility function get_attention_backend to help with availability tests
Add mask type and window size params for decoder situations

Checklist:

[x ] I have read and followed the contributing guidelines
[x ] The functionality is complete
[x ] I have commented my code, particularly in hard-to-understand areas
[x ] I have made corresponding changes to the documentation
[x ] My changes generate no new warnings
[x ] I have added tests that prove my fix is effective or that my feature works
[x ] New and existing unit tests pass locally with my changes

cyanguwa commented 1 day ago

/te-ci pytorch

cyanguwa commented 1 day ago

/te-ci pytorch

cyanguwa commented 15 hours ago

/te-ci pytorch

cyanguwa commented 9 hours ago

/te-ci pytorch

NVIDIA / TransformerEngine

[C/PyTorch] Add support for bottom-right-diagonal causal mask #960

Description

Type of change

Changes

Checklist: