facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
29.35k stars 3.48k forks source link

Supporting ARM SVE, the newer extended vector instruction set for aarch64 #2884

Open vorj opened 1 year ago

vorj commented 1 year ago

Summary

Dear @mdouze and all,

ARM SVE is a newer extended vector instruction set than NEON and is supported on CPUs like AWS Graviton3 and Fujitsu A64fx. I've added SVE support and some functions implemented with SVE to faiss, then compared their execution times. It seems that my implementation improves the performance on some environment. This is just first implementation to show the ability of SVE, and I plan to implemnent SVE version of other functions currently not ported to SVE.

It might be unable to check on Circle CI currently, however would you mind if I submit this as PR?

Platform

OS: Ubuntu 22.04

Faiss version: a3296f42adee7a0159b7ac09d7642e862edb142f, and mine

Installed from: compiled by myself

Faiss compilation options: cmake -B build -DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DFAISS_OPT_LEVEL=sve ( -DFAISS_OPT_LEVEL=sve is new optlevel introduced by my changes)

Running on:

Interface:

Reproduction instructions

I only post the results to search SIFT1M. If you need more detailed information, please let me know.

benchmark result

image

The above image illustrates the ratio of speed up.

mdouze commented 1 year ago

Thanks for looking into this! Do I understand correctly that this is with a 512-bit SIMD width? Indeed we should have a way to integrate code for hardware that is not supported by CircleCI (AVX512 being the other example). So we welcome a PR for this functionality.

vorj commented 1 year ago

@mdouze

Do I understand correctly that this is with a 512-bit SIMD width?

SVE is an abbreviation of Scalable Vector Extension . In this context, scalable means that the vector length is not fixed on the instruction set. The vector length is specified by each CPU, for example, A64fx has 512bit SVE register, but Graviton3 has 256bit SVE register. So programmer should write length-independent code, then the binary will work on each CPU with detecting a real vector length at run time. The length of SVE register becomes 128*n bits in the range of [128, 2048] bits.

So we welcome a PR for this functionality.

I'm glad to hear that! :smile: I will make the PRs later.

alexanderguzhva commented 1 year ago

@vorj thanks for your PR! I have couple questions, just in order to get some knowledge of SVE. 1) Do I get it correct that if, say, SVE vector length is 512 bits, then it still will be possible to have evaluations for 256 bits and 128 bits, just like AVX-512 extends AVX2, which extends AVX? 2) Do I get it right that the most speedup effect is related to faster distance computation (fvec_L2sqr_* and fvec_inner_product_* functions)?

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

vorj commented 1 year ago

@alexanderguzhva To answer your question,

  1. Let the vector length is 512bit and considering the 32bit element type (so the vector is used as 16 x 32bit). Below, I will represent the mask as {mask0, mask1, ..., mask15}. If you pass the mask as {1, 1, 1, 1, 0, 0, 0, 0, ..., 0} , you can load/calculate/store 4 x 32bit (=128bit) data. {1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..., 0} is for 256bit. Of course this should be slow rather than using full length. Another option, you can still use Advanced SIMD(NEON) for 128/64bit SIMD instruction set. When you use it, you must need to write peel loop or something like that for the data with non-4-multiple length in the same manner as before.
  2. At least in this PR, almost yes. I plan to make another PR, which contains SVE implementation of code_distance and exhaustive_L2sqr_blas .

Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.

Thank you! 😄