ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

Add support for self-attention layer for GPU #1069

Closed atrah22 closed 9 months ago

atrah22 commented 10 months ago

Hello, this post is regarding a feature request. Recently, there is a growing need to do inference on-device by using variants of stable-diffusion models. Some implementation tricks for self-attention layer have improved performance of transformers, stable-diffusion or any models with self-attention layers significantly. It is now known that the memory read and write , or in general memory is the bottleneck compared to the computation for self-attention layers. Is there any effort going on to implement flash attention/ xFormers library etc. for ARM GPU ? These libraries improved performance of models with self-attention layers significantly running in nvidia gpu.

morgolock commented 9 months ago

Hi @atrah22

Thanks for raising this. We've been actively working to improve the performance on Mali GPUs. A big part of this effort is the new feature dynamic fusion for the OpenCL backend. Dynamic fusion will allow us to write an efficient implementation of this layer.

Dynamic fusion is still experimental but we have already ported some kernels to use this new feature. So the answer is yes, we are moving in that direction and making the changes required to implement this efficiently. The interface for this feature in the library is the compute kernel writer (CKW), see the following patch porting the Resize operator to use CKW https://review.mlplatform.org/c/ml/ComputeLibrary/+/10174

Are you looking into any specific papers on how to implement this layer? Can you please share with us any links to models or documents you're looking at ?

Hope this helps,