facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
29.43k stars 3.48k forks source link

Improve SIMD implementation of distance calculation #2851

Open Catoverflow opened 1 year ago

Catoverflow commented 1 year ago

Should I PR when available?

alexanderguzhva commented 1 year ago

Hi. Sure, we're happy to welcome any PR. Several things need to be noted, though: 1) the lengths of 1|2|4|8|12 are the most typical cases in practice and 2|4|8 are designed to be faster than what compiler does. 2) these functions that you've mentioned are exposed to the outer world as standalone primitives, but internally they are used in a few places. Faiss code tends to rely on BLAS calls if appropriate 3) Also, Faiss relies on a compiler for general-purpose cases, please check here. For example, fvec_L2sqr_ny renders into https://godbolt.org/z/nsYocWTxo 4) there is a faster alternative in the form of fvec_L2sqr_ny_transposed (I haven't added fvec_inner_products_ny_transposed yet), which an option that needs to be manually enabled and which is applicable for speeding up PQ / IVFPQ search

Catoverflow commented 1 year ago

Thanks for replying even in weekend. :3 These days I took time profiling faiss, and found with BLAS the call sgemm_ comsumed a lot of CPU time, after I switch to AVX impl (i.e. change M value in demo to make sure the calculation is dispatched to the 8 version) the overall execution time is lower.

I'm not familiar with BLAS and other use cases in faiss. I'd like to make a deep investigation first.

alexanderguzhva commented 1 year ago

Thanks @Catoverflow . Just curious, what is your use case that you're trying to optimize? :)

Catoverflow commented 1 year ago

I think fvec_[L2sqr|inner_product]_ny can be implemented without dispatcher, and use a general AVX distance calculation at least (as I described in the first comment). But as you mentioned the compiler could do a good job I'd like to check first because that contradicts to my profiling result (I set M=16 instead of 4 in demo_ivfpq_indexing.cpp and saw the hotspot in fvec_inner_product_ny shrank a lot).

I'm currently busy completing my graduation design and I'm sorry if this issue will be open for half a month or so :(

alexanderguzhva commented 1 year ago

The compiler uses SIMDs of 32 floats for computing fvec_L2sqr_ny or fvec_inner_product_ny, and there are additional use cases for 1, 2, 4, 8 and 12 dims. So, you are correct that M=16 case may be not SIMD-ed well enough. I may add a special case for M=16 or some other M values if needed.