amirzandieh / HyperAttention

Triton Implementation of HyperAttention Algorithm
Apache License 2.0
45 stars 1 forks source link

How to use with padding? #3

Open andersonbcdefg opened 7 months ago

andersonbcdefg commented 7 months ago

Can HyperAttention be used with a key_padding_mask to prevent padding tokens from being attended to in bidirectional attention? I understand this doesn't matter in the causal case, but is important for BERT-like models.

amirzandieh commented 7 months ago

In the current HyperAttention implementation, key padding masks aren't supported. I value your insight on the importance of incorporating them, especially for models like BERT. I'll be working on an update to seamlessly integrate support for key padding masks. Thanks for bringing this to my attention, and I'm excited to enhance the algorithm to better meet your needs.