Can you apply masks in this attention model?

cmsflash / efficient-attention

An implementation of the efficient attention module.

https://arxiv.org/abs/1812.01243

MIT License

272 stars 26 forks source link

Can you apply masks in this attention model? #4

Closed rongcuid closed 2 years ago

rongcuid commented 3 years ago

In Seq2Seq models, it is common to apply mask to remove paddings, or to remove future inputs of a causal model. Is it possible to do so in Efficient Attention, as it does not have a key seq to query seq mapping?

cmsflash commented 3 years ago

Hi Rongcui, thank you for your interest in our work. It would be pretty straightforward to mask paddings, as you just need to mask the respective tokens in the queries, keys, and values.

It would not be possible to do causal masking though, as it requires a different receptive field for each token. Efficient attention relies on the assumption that all tokens share the same receptive field.