aai-institute / continuiti

Learning function operators with neural networks.
GNU Lesser General Public License v3.0
25 stars 3 forks source link

Feature: Heterogeneous Normalized Attention #153

Open JakobEliasWagner opened 4 months ago

JakobEliasWagner commented 4 months ago

Feature: Heterogeneous Normalized Attention

Description

This pull request introduces the implementation of the Heterogeneous Normalized Attention mechanism as described in the paper Hao et al., 2023.

The heterogeneous normalized attention block calculates the attention scores in these steps:

  1. normalize the query and key sequence first

$$\tilde{q}_i = Softmax(q_i)$$

$$\tilde{k}_i = Softmax(k_i)$$

  1. calculate the attention score without softmax

$$z_t = \sum_i \frac{\tilde{q}_t \tilde{k}_i}{\sum_j \tilde{q}_t \tilde{k}_j}v_i$$

This implementation is linear with respect to the sequence length.

We added a masking mechanism to the vanilla implementation suggested by Hao et al.

Which issue does this PR tackle?

How does it solve the problem?

How are the changes tested?

Checklist for Contributors

Checklist for Reviewers: