getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 64 forks source link

Masked attention #203

Open gahdritz opened 2 years ago

gahdritz commented 2 years ago

In the MultiheadAttention implementation on the attention branch, attention masking is not implemented. Is that because it is difficult/impossible to do using KeOps? If that's not the case, how would one add that functionality?

In my particular use case, I also have another bias term, of shape [1, H, I, J] (where my attention logits are of shape [B, H, I, J] and B tends to be very large). Assuming that the attention logits have been computed using KeOps as a LazyTensor, is it possible to add this second bias term to the attention logits before the softmax and reduction?

jeanfeydy commented 2 years ago

Hi @gahdritz ,

Thanks for your question and as in #204 , apologies for the late answer.

With respect to masked attention: this is very do-able, I just haven’t invested too much time in the attention branch just yet. (I am currently in the process of developing an open benchmarking platform for this kind of operations though, so there should be good progress in 2022!)

I assume that you are interested in “directional” masking, where queries “q_i” should only see keys “k_j” such that “i >= j”? If this is the case, you can implement it easily by introducing new LazyTensors for the token positions i and j and multiplying your LazyTensor with a conditional formula such as (i - j).step() (we will add the simpler syntax (i >= j) very soon…). For small sentences (<200 tokens), this should provide optimal run times. For larger sentences (with e.g. >1,000 tokens), you may be interested in our documentation about block-sparsity masks to skip large blocks of computations where j >> i.

As for bias terms: sure, there should be no problem, feel free to hack our demo for the attention layer any way you like. The KeOps math engine is specifically designed to enable this type of modification :-)

Best regards, Jean

gahdritz commented 2 years ago

Hi Jean,

I've had some trouble adding the second bias term, because it interacts with both symbolic dimensions I and J (and broadcasts along the batch dimension instead). Is it really possible to index both I and J at once? If so, how do I do that? KeOps doesn't let me construct a LazyTensor using that bias, for example.

jeanfeydy commented 2 years ago

Hi @gahdritz ,

Indeed, I hadn’t seen that your bias term was dense with respect to both I and J. If that’s the case, there may not be a good way of speed things up with KeOps: the library is all about avoiding the storage and transfer of “I-by-J” variables.

A work-around could be to encode your bias term as a symbolic matrix defined using appropriate “I” and “J” variables. For instance, if B_ij is a (I, J) bias matrix with rank R, SVD decomposition will allow you to write it down as a product U @ V.T where U is (I, R) and V is (J, R). Then, if u_i and v_j denote the lines of U and V, respectively, the dot product (u_i | v_j) defines a valid KeOps formula that computes the value of B_ij and can be added to the standard formula for the attention logits. (Note that this would work with any number of batch dimensions or attention heads.)

If the rank R is smaller than 32, this could be very efficient… But obviously, depending on your application and the properties of your bias matrix, such a strategy may not be tractable: what do you think? Best regards, Jean

gahdritz commented 2 years ago

Unfortunately, in my use case, the rank is in the 100s or even 1000s. Thanks anyway for the help!

ordabayevy commented 9 months ago

Hi @jeanfeydy , I have few follow up questions on the topic of Attention layers:

Is it possible to implement a Dropout layer in PyKeOps?

we will add the simpler syntax (i >= j) very soon…

Has this been implemented? I couldn't find it in the docs.