facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.1k stars 3.62k forks source link

Index train and add cost much time, but only 1 core was busy #1617

Open taozhijiang opened 3 years ago

taozhijiang commented 3 years ago

Summary

Index train and add cost much time, but only 1 core was busy. So I ask how to improve the performance for Index train and add? I just use IndexFlatIVF currently.

Platform

OS: CentOS Linux release 7.6.1810 (Core), Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz

Faiss version: faiss release v1.6.0

Installed from: compile C++

Faiss compilation options: -fopenmp

Running on:

Interface:

Reproduction instructions

step1: ((faiss::Index)index_)->train(train_size, fb); step2: ((faiss::Index)index_)->add(total_line_num, fb);

When we finished the training stage, and get out our desired cells, but we found the next add stage cost too much time. And during the add procedure, only 1 cpu core was used, the others are idle and the total CPU usage was quite low. So I am wondering whether after train stage, we can add items in parallel, or my usage has some pitfall?

demo: 9million vectors with dim 512, the training stage use 8h, and the add procedure cost 21h!

mdouze commented 3 years ago

The code was probably not compiled with openmp. Could you call faiss::check_openmp() somewhere in the code?

taozhijiang commented 3 years ago

The code was probably not compiled with openmp. Could you call faiss::check_openmp() somewhere in the code?

faiss::check_openmp() returns true.

taozhijiang commented 3 years ago

I read the code: in IndexIVFFlat.cpp, add_with_ids calls add_core, and the add actions executes in serial. in IndexIVF.cpp, add_with_ids can deal with omp in parallel.

This means IndexIVFFlat can only eatup 1core ??

mdouze commented 3 years ago

Right, this is an inconsistency. I think it's because it's not much faster with more cores. I will mark as enhancement to fix that.

taozhijiang commented 3 years ago

Actually when faiss was built with openblasp (not default blas), the train and add precedure can eatup all the cores. Somewhat wired, but why?

r00tk1ts commented 3 years ago

IndexIVFPQ

Right, this is an inconsistency. I think it's because it's not much faster with more cores. I will mark as enhancement to fix that.

Function 'add_core_o' in IndexIVFPQ.cpp has the same issue, and there is a parallelize TODO inside.

AbdelrahmanElmeniawy commented 2 years ago

For IndexIVFPQ: add_core_o method consists of three main parts. first part(compute Ids), the second part(product quantizer compute codes) is the actual bottleneck in this method but the third part(add vectors to the invlist) which is needed to parallelise . By checking the relative time of the third part to the second one to find is it really useful and will affect the total running time of this method or not, you will find that the third part, needed to parallelise, is actually run in no more that 2% of the total time, as seen in the screenshot, and even if we improved this part to make it run in 0 ms it wouldn't be remarkable. so, it will not be valuable to add more complications to code without gain valuable improvements.

Screenshot 2022-09-13 at 11 21 31 am