Aub-M commented 1 year ago

Summary

search() call is surprisingly long with the C++ library compared to the same call to the python version

Platform

OS: Ubuntu 22.04

Faiss version: 1.7.3 (release) Installed from: compiled locally

Faiss compilation options: no GPU, no python, building shared libraries in Release Intel MKL is deployed beforehand and found by cmake during the process

Running on:

[X] CPU
[ ] GPU

Interface:

[X] C++
[ ] Python

Reproduction instructions

Context : we're building an index (IndexPQ) from around 30M strings that are vectorized. The building of the index is done in python, with the python library installed with anaconda. Then, the index is imported in another project, in C++ this time, where it is read and where searches are done.

We have two use cases, one that will search multiple vectors at the same time, and one where we're searching only for one. We're currently working on the latter. In python, the search of a single vector gives a result in 2 or 3ms. On the C++ side, doing the same search, with the same vector and the same index takes from ~260 to ~360ms. After some research, I found that OpenMP could cause problems, although the parallelization is only useful when searching multiple vectors at the same time, according to the page about threads. To get more clues, I profiled 10 calls to get an idea where the time was spent (the call selected here is search(), average time = ~360ms) :

Screenshot_20230414_164421

Then I tried to force OpenMP to work with only one thread, with omp_set_num_threads() (average time = ~280ms) :

Screenshot_20230414_162304

The profiler is not able to track what is called inside the pragma omp section, so I'm still a bit blind...

I could need some help to figure out if this is normal and if I can do anything to get a response from the C++ library as quickly as it is possible to get it from the python version.

Is this the overhead to pay for the parallelization, for any call, as long as OpenMP is involved ?
Can you give me clues on how the time inside the omp section is spent ?
I saw that it is possible to remove OpenMP from the picture by removing the flag somewhere in the compilation files, but as explained above, we will also submit queries by batches.

Thanks in advance for your help !

alexanderguzhva commented 1 year ago

Hi. Faiss is pretty heavily optimized for batched search. Also, the incoming 1.7.4 version (next week release, I hope) brings performance improvements for ProductQuantizer::search() call. Hope it helps.

@mdouze may bring some additional suggestions.

alexanderguzhva commented 1 year ago

I think I understand what the problem is. @Aub-M , will you be able to test and provide a feedback if I publish a pull request? Thanks.

Aub-M commented 1 year ago

Sure, I'll deploy it here. Feel free to ask for specific details if you need, I'll try to record as much as possible.

Aub-M commented 1 year ago

Hello again, I just saw your pull request so I tested it (pulled main then applied your commit). There is no more forks in the profiling, but the response time stays roughly the same (the omp_outlined is still there, so I cannot report on what's happening inside).

Screenshot_20230417_172229

alexanderguzhva commented 1 year ago

Hi @Aub-M . I believe that the following situation means that the most of the time is spent in pq_knn_search_with_tables<>() call (see faiss/ProductQuantizer.cpp, line 746), you can verify this by running perf record using perf linux tool or collecting hotspots in MSVC profiler in Windows. According to what I see, pq_knn_search_with_tables takes the most time. Also, I've got an experimental patch that improves the ProductQuantizer::search() time by 25-40% for single request/batch request mode (for my machine / experiments), and I just need time to create a proper PR. Would you mind sharing the dimensionality of your data and PQ settings that you're using so I could test this case as well? Are you using AMD or Intel CPUs? Or ARM? Thanks.

Aub-M commented 1 year ago

Hello,

I can confirm that the time is indeed spent in pq_knn_search_with_tables, I removed the omp instructions completely, and since Valgrind cannot see templates, only a fraction of a percent remained tracked in ProductQuantizer::search(), the rest disappeared.

Now about the hardware, I feel a bit dumb... I assumed both experiments, in Python and C++ where done on the same generic laptop, but they were not... The Python execution is taking place on an AMD Ryzen 7 3700X, while the C++ is executed on an i7-10510U. I feel that alone could explain the difference...

As for the data, here are the parameters :

Approximately 30M vectors in the index
Index : PQ128, with default nbit
Vector size : 1024

facebookresearch / faiss

OpenMP support and single vector search #2823

Summary

Platform

Reproduction instructions