Open jrcavani opened 2 years ago
Add vecs:
# add 10000 vecs with all omp threads
In [50]: %timeit index_flat.add(xb[:10000])
The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached.
32 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [51]: %timeit index_fp16.add(xb[:10000])
The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached.
24.8 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [54]: %timeit index_sq8.add(xb[:10000])
15.1 ms ± 6.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [56]: %timeit index_ivf_sq8.add(xb[:10000])
10.9 s ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# add 10000 vecs with 1 thread
In [81]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit -n 100 index_flat.add(xb[:10000])
The slowest run took 6.22 times longer than the fastest. This could mean that an intermediate result is being cached.
37.9 ms ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [81]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_fp16.add(xb[:10000])
The slowest run took 13.38 times longer than the fastest. This could mean that an intermediate result is being cached.
35.3 ms ± 55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [80]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.add(xb[:10000])
16.7 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [81]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit -n 1 index_ivf_sq8.add(xb[:10000])
10.2 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Notably adding only 1 vec also activates all omp threads:
# add 1 vec
In [63]: %timeit index_flat.add(xb[:1])
5.38 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [64]: %timeit index_fp16.add(xb[:1])
155 µs ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # all threads firing
In [65]: %timeit index_sq8.add(xb[:1])
The slowest run took 4.16 times longer than the fastest. This could mean that an intermediate result is being cached.
165 µs ± 61.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # all threads firing
In [66]: %timeit index_ivf_sq8.add(xb[:1])
47.9 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # all threads firing
Reconstruction (notice all-threaded reconstruct_n
for the IndexIVFScalarQuantizer
is a lot slower than the single threaded version despite using all threads.
# reconstruct 10000 vecs with all omp threads
In [42]: %timeit index_flat.reconstruct_n(0, 10000)
3.13 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %timeit index_fp16.reconstruct_n(0, 10000)
3.31 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [44]: %timeit index_sq8.reconstruct_n(0, 10000)
2.92 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [46]: %timeit index_ivf_sq8.reconstruct_n(0, 10000)
693 ms ± 273 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# reconstruct 10000 vecs with 1 thread
In [37]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_flat.reconstruct_n(0, 10000)
3.13 ms ± 78.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [38]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_fp16.reconstruct_n(0, 10000)
8.17 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [39]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.reconstruct_n(0, 10000)
7.07 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [40]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_ivf_sq8.reconstruct_n(0, 10000)
24.2 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Would you mind re-doing the experiments with an anaconda based Faiss install? pip sometimes messes up openmp versions.
You are right. And heeding the wisdom I confirmed mkl
compiled Faiss (currently going as high as python3.8
from pytorch
conda channel) leads to 1.5x threaded performance than the pip installed version with openblas
. I also tried to install from conda-forge
channel and they use openblas
as well. Noted that they have precompiled versions for higher than python3.8
(seems not limited by python versions).
Same encode calls (also running on a less busy machine, but the relative comparison should still hold). 32 physical cores and 64 total available threads:
In [39]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_flat.sa_encode(xb[:10000])
...:
1.46 ms ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [40]: %timeit index_flat.sa_encode(xb[:10000])
...:
1.46 ms ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_fp16.sa_encode(xb[:10000])
...:
3.22 ms ± 525 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit index_fp16.sa_encode(xb[:10000])
670 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [13]:
...: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.sa_encode(xb[:10000])
...:
8.31 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit index_sq8.sa_encode(xb[:10000])
...:
661 µs ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
There is of course threading overhead so speedup is not linear but this looks much more reasonable. Scaling 10x and 100x batch size doesn't make it linear still though.
In [15]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.sa_encode(xb[:100000])
...:
89.1 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [17]: In [21]: %timeit index_sq8.sa_encode(xb[:100000])
11 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [20]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.sa_encode(xb[:1000000])
...:
900 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit index_sq8.sa_encode(xb[:1000000])
115 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Encoding one single vector at a time for ScalarQuantizer
still activates all threads. And for comparison Flat
of course doesn't do that.
In [25]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_flat.sa_encode(xb[:1])
...:
3.06 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [24]: %timeit index_flat.sa_encode(xb[:1])
3.07 µs ± 4.64 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [21]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.sa_encode(xb[:1])
...:
4.19 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [22]: %timeit index_sq8.sa_encode(xb[:1])
8.2 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [37]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_ivf_sq8.sa_encode(xb[:1])
...:
1.05 ms ± 474 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [38]: %timeit index_ivf_sq8.sa_encode(xb[:1])
1.24 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Adding follows exactly the same
In [27]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_flat.add(xb[:1])
...:
3.35 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [26]: %timeit index_flat.add(xb[:1])
3.26 µs ± 916 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [29]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_sq8.add(xb[:1])
...:
3.7 µs ± 459 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [28]: %timeit index_sq8.add(xb[:1])
7.37 µs ± 394 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [31]: with threadpool_limits(limits=1, user_api="openmp"):
...: %timeit index_ivf_sq8.add(xb[:1])
...:
1.05 ms ± 449 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [32]: %timeit index_ivf_sq8.add(xb[:1])
1.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I checked the Faiss code again and saw nothing wrong. My best guess is just mkl
multithreading kicking in with single-vec.
Hmm right, encoding a single vector triggers a dynamic allocation + omp parallel for unconditionally... So lots of overhead
https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L1156
Fixing this would require restructuring the templates of SQ, a lot of work.
Summary
I tried to benchmark the threaded performance for
IndexScalarQuantizer
,IndexPQ
,IndexIVFScalarQuantizer
,IndexIVFPQ
and noticed that if using default openmp setting with all threads available, all threads are activated during encode, even if there is only one vector given. This leads to all cores being used but speed is the same, if not slower as 1 single thread.Platform
Linux
OS: ubuntu20.04
Faiss version: 1.7.2
Installed from:
pip install faiss-gpu
Running on:
Interface:
Reproduction instructions
Encode vecs comparison:
I'm not an OpenMP expert but understand the core concepts. So not 100% certain if these encode/decode parallelization blocks will fire all threads configured even if there is only 1 or few vector?
https://github.com/facebookresearch/faiss/blob/v1.7.2/faiss/impl/ScalarQuantizer.cpp#L1292-L1294 https://github.com/facebookresearch/faiss/blob/v1.7.2/faiss/IndexScalarQuantizer.cpp#L147-L167
But looks like it's clear the way to add vectors is to have multiple processes load the same trained index and add subportions of them in a single-threaded way on the same host, and later seek to combine them.