5g4s / paper

0 stars 0 forks source link

Reformer: The Efficient Transformer #29

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2001.04451

5g4s commented 1 year ago

We replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from $\mathrm{O}\left(L^2\right)$ to $\mathrm{O}\left(L \log L\right)$, where $L$ is the length of the sequence.

5g4s commented 1 year ago

We are actually only interested in softmax $\left(Q K^T\right)$. Since softmax is dominated by the largest elements, for each query $q_i$ we only need to focus on the keys in K that are closest to $q_i$.

5g4s commented 1 year ago

The problem of finding nearest neighbors quickly in high-dimensional spaces can be solved by locality-sensitive hashing(LSH). A hashing scheme that assigns each vector $x$ to a hash $h\left(x\right)$ is called locality-sensitive if nearby vectors get the same hash with high probability and distant ones do not.

5g4s commented 1 year ago

Two points $x$ and $y$ are unlikely to share the same hash buckets (above) for the three different angular hashes unless their spherical projections are close to one another (below). image

5g4s commented 1 year ago

(a) does not take advantage of this sparsity. (b) has sorted queries and keys according to their has bucket. Hash buckets in this formulation tend to be uneven in size, which makes it difficult to batch across buckets. (c) sort the queries by bucket number and, within each bucket, by sequence position. Pairs from the same bucket will cluster near the diagonal. image