facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.52k stars 3.56k forks source link

`IndexScalarQuantizer`, `IndexPQ`, `IndexIVFScalarQuantizer`, `IndexIVFPQ` slow add with openmp #2477

Open jrcavani opened 2 years ago

jrcavani commented 2 years ago

Summary

I tried to benchmark the threaded performance for IndexScalarQuantizer, IndexPQ, IndexIVFScalarQuantizer, IndexIVFPQ and noticed that if using default openmp setting with all threads available, all threads are activated during encode, even if there is only one vector given. This leads to all cores being used but speed is the same, if not slower as 1 single thread.

Platform

Linux

OS: ubuntu20.04

Faiss version: 1.7.2

Installed from: pip install faiss-gpu

Running on:

Interface:

Reproduction instructions

d = 512
In [45]: index_flat = faiss.IndexFlat(d, faiss.METRIC_INNER_PRODUCT)

In [46]: index_fp16 = faiss.IndexScalarQuantizer(d, faiss.ScalarQuantizer.QT_fp16, faiss.METRIC_INNER_PRODUCT)

In [47]: index_sq8 = faiss.read_index("SQ8_trained.index")

In [48]: index_ivf_sq8 = faiss.read_index("IVF65536,SQ8_trained.index")

Encode vecs comparison:

In [17]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_fp16.sa_encode(xb[:10000])
6.64 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %timeit index_fp16.sa_encode(xb[:10000])
5.92 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.sa_encode(xb[:10000])
13.8 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit index_sq8.sa_encode(xb[:10000])
6.52 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm not an OpenMP expert but understand the core concepts. So not 100% certain if these encode/decode parallelization blocks will fire all threads configured even if there is only 1 or few vector?

https://github.com/facebookresearch/faiss/blob/v1.7.2/faiss/impl/ScalarQuantizer.cpp#L1292-L1294 https://github.com/facebookresearch/faiss/blob/v1.7.2/faiss/IndexScalarQuantizer.cpp#L147-L167

But looks like it's clear the way to add vectors is to have multiple processes load the same trained index and add subportions of them in a single-threaded way on the same host, and later seek to combine them.

jrcavani commented 2 years ago

Add vecs:

# add 10000 vecs with all omp threads
In [50]: %timeit index_flat.add(xb[:10000])
The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached.
32 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [51]: %timeit index_fp16.add(xb[:10000])
The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached.
24.8 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [54]: %timeit index_sq8.add(xb[:10000])
15.1 ms ± 6.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [56]: %timeit index_ivf_sq8.add(xb[:10000])
10.9 s ± 39.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# add 10000 vecs with 1 thread
In [81]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit -n 100 index_flat.add(xb[:10000])
The slowest run took 6.22 times longer than the fastest. This could mean that an intermediate result is being cached.
37.9 ms ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [81]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_fp16.add(xb[:10000])
The slowest run took 13.38 times longer than the fastest. This could mean that an intermediate result is being cached.
35.3 ms ± 55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [80]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.add(xb[:10000])
16.7 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [81]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit -n 1 index_ivf_sq8.add(xb[:10000])
10.2 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notably adding only 1 vec also activates all omp threads:

# add 1 vec
In [63]: %timeit index_flat.add(xb[:1])
5.38 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [64]: %timeit index_fp16.add(xb[:1])
155 µs ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # all threads firing

In [65]: %timeit index_sq8.add(xb[:1])
The slowest run took 4.16 times longer than the fastest. This could mean that an intermediate result is being cached.
165 µs ± 61.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # all threads firing

In [66]: %timeit index_ivf_sq8.add(xb[:1])
47.9 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # all threads firing
jrcavani commented 2 years ago

Reconstruction (notice all-threaded reconstruct_n for the IndexIVFScalarQuantizer is a lot slower than the single threaded version despite using all threads.

# reconstruct 10000 vecs with all omp threads
In [42]: %timeit index_flat.reconstruct_n(0, 10000)
3.13 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit index_fp16.reconstruct_n(0, 10000)
3.31 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [44]:  %timeit index_sq8.reconstruct_n(0, 10000)
2.92 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [46]:  %timeit index_ivf_sq8.reconstruct_n(0, 10000)
693 ms ± 273 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# reconstruct 10000 vecs with 1 thread
In [37]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_flat.reconstruct_n(0, 10000)
3.13 ms ± 78.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [38]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_fp16.reconstruct_n(0, 10000)
8.17 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [39]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.reconstruct_n(0, 10000)
7.07 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [40]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_ivf_sq8.reconstruct_n(0, 10000)
24.2 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
mdouze commented 1 year ago

Would you mind re-doing the experiments with an anaconda based Faiss install? pip sometimes messes up openmp versions.

jrcavani commented 1 year ago

You are right. And heeding the wisdom I confirmed mkl compiled Faiss (currently going as high as python3.8 from pytorch conda channel) leads to 1.5x threaded performance than the pip installed version with openblas. I also tried to install from conda-forge channel and they use openblas as well. Noted that they have precompiled versions for higher than python3.8 (seems not limited by python versions).

Same encode calls (also running on a less busy machine, but the relative comparison should still hold). 32 physical cores and 64 total available threads:

In [39]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_flat.sa_encode(xb[:10000])
    ...:
1.46 ms ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [40]: %timeit index_flat.sa_encode(xb[:10000])
    ...:
1.46 ms ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_fp16.sa_encode(xb[:10000])
    ...:
3.22 ms ± 525 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit index_fp16.sa_encode(xb[:10000])
670 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [13]:
    ...: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.sa_encode(xb[:10000])
    ...:
8.31 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %timeit index_sq8.sa_encode(xb[:10000])
    ...:
661 µs ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

There is of course threading overhead so speedup is not linear but this looks much more reasonable. Scaling 10x and 100x batch size doesn't make it linear still though.

In [15]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.sa_encode(xb[:100000])
    ...:
89.1 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]: In [21]: %timeit index_sq8.sa_encode(xb[:100000])
11 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.sa_encode(xb[:1000000])
    ...:
900 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit index_sq8.sa_encode(xb[:1000000])
115 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
jrcavani commented 1 year ago

Encoding one single vector at a time for ScalarQuantizer still activates all threads. And for comparison Flat of course doesn't do that.

In [25]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_flat.sa_encode(xb[:1])
    ...:
3.06 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [24]: %timeit index_flat.sa_encode(xb[:1])
3.07 µs ± 4.64 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [21]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.sa_encode(xb[:1])
    ...:
4.19 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [22]: %timeit index_sq8.sa_encode(xb[:1])
8.2 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [37]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_ivf_sq8.sa_encode(xb[:1])
    ...:
1.05 ms ± 474 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [38]: %timeit index_ivf_sq8.sa_encode(xb[:1])
1.24 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Adding follows exactly the same

In [27]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_flat.add(xb[:1])
    ...:
3.35 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [26]: %timeit index_flat.add(xb[:1])
3.26 µs ± 916 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [29]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_sq8.add(xb[:1])
    ...:
3.7 µs ± 459 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [28]: %timeit index_sq8.add(xb[:1])
7.37 µs ± 394 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [31]: with threadpool_limits(limits=1, user_api="openmp"):
    ...:     %timeit index_ivf_sq8.add(xb[:1])
    ...:
1.05 ms ± 449 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [32]: %timeit index_ivf_sq8.add(xb[:1])

1.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I checked the Faiss code again and saw nothing wrong. My best guess is just mkl multithreading kicking in with single-vec.

mdouze commented 1 year ago

Hmm right, encoding a single vector triggers a dynamic allocation + omp parallel for unconditionally... So lots of overhead

https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L1156

Fixing this would require restructuring the templates of SQ, a lot of work.