Open vorj opened 1 year ago
Thanks for looking into this! Do I understand correctly that this is with a 512-bit SIMD width? Indeed we should have a way to integrate code for hardware that is not supported by CircleCI (AVX512 being the other example). So we welcome a PR for this functionality.
@mdouze
Do I understand correctly that this is with a 512-bit SIMD width?
SVE is an abbreviation of Scalable Vector Extension . In this context, scalable means that the vector length is not fixed on the instruction set. The vector length is specified by each CPU, for example, A64fx has 512bit SVE register, but Graviton3 has 256bit SVE register. So programmer should write length-independent code, then the binary will work on each CPU with detecting a real vector length at run time. The length of SVE register becomes 128*n bits in the range of [128, 2048] bits.
So we welcome a PR for this functionality.
I'm glad to hear that! :smile: I will make the PRs later.
@vorj thanks for your PR! I have couple questions, just in order to get some knowledge of SVE.
1) Do I get it correct that if, say, SVE vector length is 512 bits, then it still will be possible to have evaluations for 256 bits and 128 bits, just like AVX-512 extends AVX2, which extends AVX?
2) Do I get it right that the most speedup effect is related to faster distance computation (fvec_L2sqr_*
and fvec_inner_product_*
functions)?
Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.
@alexanderguzhva To answer your question,
{mask0, mask1, ..., mask15}
. If you pass the mask as {1, 1, 1, 1, 0, 0, 0, 0, ..., 0}
, you can load/calculate/store 4 x 32bit (=128bit) data. {1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..., 0}
is for 256bit. Of course this should be slow rather than using full length. Another option, you can still use Advanced SIMD(NEON) for 128/64bit SIMD instruction set. When you use it, you must need to write peel loop or something like that for the data with non-4-multiple length in the same manner as before.code_distance
and exhaustive_L2sqr_blas
.Also, I'll be happy to assist and point you to the bottlenecks that would benefit from custom ARM code, if needed.
Thank you! 😄
Summary
Dear @mdouze and all,
ARM SVE is a newer extended vector instruction set than NEON and is supported on CPUs like AWS Graviton3 and Fujitsu A64fx. I've added SVE support and some functions implemented with SVE to faiss, then compared their execution times. It seems that my implementation improves the performance on some environment. This is just first implementation to show the ability of SVE, and I plan to implemnent SVE version of other functions currently not ported to SVE.
It might be unable to check on Circle CI currently, however would you mind if I submit this as PR?
Platform
OS: Ubuntu 22.04
Faiss version: a3296f42adee7a0159b7ac09d7642e862edb142f, and mine
Installed from: compiled by myself
Faiss compilation options:
cmake -B build -DFAISS_ENABLE_GPU=OFF -DPython_EXECUTABLE=$(which python3) -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON -DFAISS_OPT_LEVEL=sve
(-DFAISS_OPT_LEVEL=sve
is new optlevel introduced by my changes)Running on:
Interface:
Reproduction instructions
I only post the results to search SIFT1M. If you need more detailed information, please let me know.
original
is the current (a3296f42adee7a0159b7ac09d7642e862edb142f) implementationSVE
is the result of my implementation supporting ARM SVEThe above image illustrates the ratio of speed up.
SVE
is approx. 2.26x faster thanoriginal
(IndexIVFPQ + IndexHNSWFlat, M: 32 nprove: 16)original
: 0.618 msSVE
: 0.274 ms