Open andersonbcdefg opened 7 months ago
In the current HyperAttention implementation, key padding masks aren't supported. I value your insight on the importance of incorporating them, especially for models like BERT. I'll be working on an update to seamlessly integrate support for key padding masks. Thanks for bringing this to my attention, and I'm excited to enhance the algorithm to better meet your needs.
Can HyperAttention be used with a key_padding_mask to prevent padding tokens from being attended to in bidirectional attention? I understand this doesn't matter in the causal case, but is important for BERT-like models.