Open Catoverflow opened 1 year ago
Hi. Sure, we're happy to welcome any PR.
Several things need to be noted, though:
1) the lengths of 1|2|4|8|12
are the most typical cases in practice and 2|4|8
are designed to be faster than what compiler does.
2) these functions that you've mentioned are exposed to the outer world as standalone primitives, but internally they are used in a few places. Faiss code tends to rely on BLAS calls if appropriate
3) Also, Faiss relies on a compiler for general-purpose cases, please check here. For example, fvec_L2sqr_ny
renders into https://godbolt.org/z/nsYocWTxo
4) there is a faster alternative in the form of fvec_L2sqr_ny_transposed
(I haven't added fvec_inner_products_ny_transposed
yet), which an option that needs to be manually enabled and which is applicable for speeding up PQ / IVFPQ search
Thanks for replying even in weekend. :3 These days I took time profiling faiss, and found with BLAS the call sgemm_
comsumed a lot of CPU time, after I switch to AVX impl (i.e. change M value in demo to make sure the calculation is dispatched to the 8
version) the overall execution time is lower.
I'm not familiar with BLAS and other use cases in faiss. I'd like to make a deep investigation first.
Thanks @Catoverflow . Just curious, what is your use case that you're trying to optimize? :)
I think fvec_[L2sqr|inner_product]_ny
can be implemented without dispatcher, and use a general AVX distance calculation at least (as I described in the first comment). But as you mentioned the compiler could do a good job I'd like to check first because that contradicts to my profiling result (I set M=16 instead of 4 in demo_ivfpq_indexing.cpp and saw the hotspot in fvec_inner_product_ny shrank a lot).
I'm currently busy completing my graduation design and I'm sorry if this issue will be open for half a month or so :(
The compiler uses SIMDs of 32 floats for computing fvec_L2sqr_ny
or fvec_inner_product_ny
, and there are additional use cases for 1, 2, 4, 8 and 12 dims. So, you are correct that M=16 case may be not SIMD-ed well enough. I may add a special case for M=16 or some other M values if needed.
Problem I find implementation of
fvec_[L2sqr|inner_product]_ny
flawed. It only supports SIMD calculation of vectors with length of1|2|4|8|12
, which does not covers many cases. For example, in demo_ivfpq_indexing.cpp the length of subvectors is 32, and in this casefvec_[L2sqr|inner_product]_ny
will fallback to step-by-step calculation (though loops are unfolded by compiler).Solution General distance calculation can be achieved by segmenting the vectors into chunks (with length of 16, for example) and then collecting the distance sum of chunks and remains. hnswlib is a good reference here though there maybe license issues.
Should I PR when available?