Improve SIMD implementation of distance calculation

facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.

https://faiss.ai

MIT License

29.43k stars 3.48k forks source link

Improve SIMD implementation of distance calculation #2851

Open Catoverflow opened 1 year ago

Catoverflow commented 1 year ago

Problem I find implementation of fvec_[L2sqr|inner_product]_ny flawed. It only supports SIMD calculation of vectors with length of 1|2|4|8|12, which does not covers many cases. For example, in demo_ivfpq_indexing.cpp the length of subvectors is 32, and in this case fvec_[L2sqr|inner_product]_ny will fallback to step-by-step calculation (though loops are unfolded by compiler).
Solution General distance calculation can be achieved by segmenting the vectors into chunks (with length of 16, for example) and then collecting the distance sum of chunks and remains. hnswlib is a good reference here though there maybe license issues.

Should I PR when available?

alexanderguzhva commented 1 year ago

Hi. Sure, we're happy to welcome any PR. Several things need to be noted, though: 1) the lengths of 1|2|4|8|12 are the most typical cases in practice and 2|4|8 are designed to be faster than what compiler does. 2) these functions that you've mentioned are exposed to the outer world as standalone primitives, but internally they are used in a few places. Faiss code tends to rely on BLAS calls if appropriate 3) Also, Faiss relies on a compiler for general-purpose cases, please check here. For example, fvec_L2sqr_ny renders into https://godbolt.org/z/nsYocWTxo 4) there is a faster alternative in the form of fvec_L2sqr_ny_transposed (I haven't added fvec_inner_products_ny_transposed yet), which an option that needs to be manually enabled and which is applicable for speeding up PQ / IVFPQ search

Catoverflow commented 1 year ago

Thanks for replying even in weekend. :3 These days I took time profiling faiss, and found with BLAS the call sgemm_ comsumed a lot of CPU time, after I switch to AVX impl (i.e. change M value in demo to make sure the calculation is dispatched to the 8 version) the overall execution time is lower.

I'm not familiar with BLAS and other use cases in faiss. I'd like to make a deep investigation first.

alexanderguzhva commented 1 year ago

Thanks @Catoverflow . Just curious, what is your use case that you're trying to optimize? :)

Catoverflow commented 1 year ago

I think fvec_[L2sqr|inner_product]_ny can be implemented without dispatcher, and use a general AVX distance calculation at least (as I described in the first comment). But as you mentioned the compiler could do a good job I'd like to check first because that contradicts to my profiling result (I set M=16 instead of 4 in demo_ivfpq_indexing.cpp and saw the hotspot in fvec_inner_product_ny shrank a lot).

I'm currently busy completing my graduation design and I'm sorry if this issue will be open for half a month or so :(

alexanderguzhva commented 1 year ago

The compiler uses SIMDs of 32 floats for computing fvec_L2sqr_ny or fvec_inner_product_ny, and there are additional use cases for 1, 2, 4, 8 and 12 dims. So, you are correct that M=16 case may be not SIMD-ed well enough. I may add a special case for M=16 or some other M values if needed.