[Example Proposal] Attention Layer

hypnopump commented 3 years ago

Hi there! I've found this library so cool!

I've coded a multi-head attention layer in KeOps, with a reference implementation in PyTorch. I think it would make a good and useful example for beginnners, since it's a very practical usecase and provides the translation to a mainstream framework...

Here's the link to the code: https://gist.github.com/hypnopump/73bd8b3968b00cc6342401bb2dffdc19

Hope this helps!

joanglaunes commented 3 years ago

Hello @hypnopump , Thank you very much, I think this can be very helpful indeed ! Would you be ok to include your example in our examples or tutorials section ? We could also simply provide a link to your code. I am not at all expert in attention layers ; I think @jeanfeydy knows much more about it so he may have good comments to make. I just tried your script and the only "downside" with the script as it is now is that it runs faster with plain PyTorch than with KeOps. This is just because you picked very small dimensions parameters ; do you have an idea about putting standard use-case dimensions and would it show the benefit of KeOps with respect to memory and speed ?

hypnopump commented 3 years ago

Hm yes i can try that in the gpu (currently doing everything in the laptop), although the point of the code was not speed but reduced memory consumption. My only question would be if you could provide a conda virtualenv requirements.yml or a requirements.txt that i could use to get KeOps working on the GPU, since right now i can only get the 1.4.1 version but not the 1.5.

A summary of the things i've tried:

Modifying @jeanfeydy conda environment .yml file to install keops 1.5 (it caused a Segmentation fault: core dumped while testing both numpy and torch bindings). I thinK this must have something to do with cuda being 11.0, since there have been multiple reports of crashes, bugs, ... but updating to cuda 11.1 caused a ton of package conflicts so it's not straightforward.
Installing cudatoolkit==11.1.1 and cudatoolkit-dev==1.1 - didn't work due to an error related to cmake
cloning recursively and following the instructions here: https://github.com/getkeops/keops/blob/3efd428b55c724b12f23982c06de00bc4d02d903/.conda/recipe/meta.yaml#L5 didn't work due to lack of cuda compiler.
idk what else to try

As for now in the CPU: The keops implementation is 100x slower than the pytorch one. Inspecting the cpu usage on a MacBookPro, i can see the keops version only uses 1 cpu core, whereas the pytorch one makes more use of the available computing power.

Oh and yes, I'd be okay to incorporate it to the examples section if you want!

hypnopump commented 3 years ago

Update: I managed to reduce the runtime by almost 25% by using (q*k).sum(dim=-1) instead of (q|k). Still can't figure out a proper environment to run on the GPU.

jeanfeydy commented 3 years ago

Hi @hypnopump , (All my apologies for the late answer: I took a few weeks off coding/GitHub to move back from London to Paris.)

Thanks a lot for your help, this is a very important use case indeed! On my side, I have written a plug-in replacement layer for PyTorch MultiheadAttention in the attention branch, which is benchmarked here. This module behaves exactly "as you would expect", with a run-time ratio vs. PyTorch vanilla that is entirely dependent on the "d_kv" dimension of the attention heads:

If d_kv = 1 or 4, KeOps is ~10x faster than PyTorch.
If d_kv=16, KeOps is ~3x faster than PyTorch.
If d_kv=32, KeOps is ~1.5x faster than PyTorch.
If d_kv=64, KeOps is slightly faster than PyTorch in the forward, but slightly slower in the backward (due to the checkpointing).
If d_kv=128 or more, KeOps is useless (except for memory consumption).

Notably, there is no problem scaling up to very large sequences or batch sizes, thanks to the reduced memory footprint. I also note that as long as d_kv < 100, KeOps run times are pretty much independent of d_kv: what matters is the total number of "Flop", which is proportional to the embedding dimension (= n_heads * d_kv).

Of course, I believe that having both a "blackbox" reference implementation and an easy-to-read tutorial in the doc is important. I will be busy this week with the LogML summer school, but will be able to work on this afterwards. The best option should be to add explicit calls to the PyTorch and KeOps MultiheadAttention layers at the end of your tutorial, with a mini-benchmark and a discussion of the run times.

As far as your configuration problems are concerned: I don't remember everything (@bcharlier is our expert!), but CUDA 11.0 introduced bugs and downgrades that were fixed with ulterior releases. The key point should be to setup an environment such that nvcc --version is either e.g. 10.2 or >=11.2 (as of today, the most recent version is 11.4). Is this do-able on your machine? (N.B.: In my experience, on shared clusters, sysadmins tend to provide several CUDA versions in parallel: fixing your problem may be as simple as running the correct PATH imports at the start of your session.)

Best regards, and thanks again, Jean

P.S.: Once the new compilation engine is out (with an explicit, human-readable C++ intermediate representation), we will also be able to investigate why (q*k).sum(dim=-1) outperforms (q|k). I also thought that our implementation used multi-core processors somewhat efficiently: we should have a look :-)

hypnopump commented 3 years ago

Hi there!

Okay so @joanglaunes @jeanfeydy @bcharlier I managed to get it working with a minimal conda environment. I think some people might be in the same situation so providing it in the installation tab in the docs would be cool.

For the attention layer, @jeanfeydy , I skimmed your code, and it seems to me not fundamentally different than the one i wrote (using yours for the simplest case), so I don't know why mine is slower, but it would be nice if we could figure out why.

I've also been coding a couple functions that might be of interest for the python library bindings since they simplify 1 or 2 usecases, but i'm not sure where to place them in the current code. I'm also implementing a bunch of attention variants and some graph neural network layers in it, which i'd like to contribute eventually.

Is there anyway to join the conversation on the library development to have a clearer idea? Or as a fallback, what's the best way to send you possible updates/interesting pieces? I really love this library (I belive efficient and scalable algorithms are one of the strong needs for the future) and I'd like to contribute to its development/adoption!

jeanfeydy commented 3 years ago

Hi @hypnopump ,

Thanks for your kind words — of course, we’ll be more than happy to discuss all of this with you!

We usually meet via Zoom every other week, but things are now slowing down with holidays / semi-vacations until late August. If @joanglaunes and @bcharlier are on vacation, I could first have a chat with you to let you know more about the context of the library, where it is heading to and how your work could fit into it. I will be pretty busy this week but would be fairly available next week or (even better) the week after. What do you think?

Best regards and see you soon, Jean

hypnopump commented 3 years ago

I'm back, @jeanfeydy ,

I'm in! This one or next week would be fine for me.

Eric Alcaide

hypnopump commented 2 years ago

Hi there! @jeanfeydy since the pykeops==2.0was released, i've rerun the attention benchmark in this gist: https://gist.github.com/hypnopump/73bd8b3968b00cc6342401bb2dffdc19 and it seems the pytorch one is still 2-5x faster than keops (of course, keops has the advantage of linear footprint, but i wonder whether the attention computation should be faster (at least when small sequences are considered) and the lack of that speedup might indicate suboptimal compilation of tensor expressions).

Results w/ pytorch==1.11.0 and pykeops==2.0b0:

Batch=64, Length=32, Dim=64
Attn keops took: 0.008614063262939453
Attn pytorch took: 0.004002094268798828
Batch=64, Length=64, Dim=64
Attn keops took: 0.017642974853515625
Attn pytorch took: 0.006172895431518555
Batch=64, Length=128, Dim=64
Attn keops took: 0.058526039123535156
Attn pytorch took: 0.017385005950927734
Batch=64, Length=256, Dim=64
Attn keops took: 0.2266089916229248
Attn pytorch took: 0.048638105392456055
Batch=64, Length=512, Dim=64
Attn keops took: 0.8450109958648682
Attn pytorch took: 0.15851211547851562

L-Reichardt commented 1 year ago

@hypnopump Great gist, thanks. just for extra information, I got minor (barely mentionable) speedups by using .squeeze() over rearrange and not useing lambda

getkeops / keops

[Example Proposal] Attention Layer #174