Aurevoir-68 commented 3 years ago

Summary

Platform

OS:

Faiss version: 1.6.3

Installed from: source

Faiss compilation options:

Running on:

[x] CPU
[ ] GPU

Interface:

[x] C++
[ ] Python

Reproduction instructions

We verify the training and search time by binding CPU cores in the docker environment. The environment is as follows: number of cpu: 2( 8 cores, 16threads) docker commend: 1. docker run --gpus "device=0" --cpuset-cpus="0-7" 2. docker run --gpus "device=0" --cpuset-cpus="0-15" 3. docker run --gpus "device=0" --cpuset-cpus="0-31" We run 3-IVFPQ.cpp in "faiss/tutorial/cpp". The results are as follows: **nb：200w random data nlist：2048 nq：20 random data** cpu core | train time(s) | search time(ms) :---:|:---:|:---: cpu_0-7| 33.98| 17.42 cpu_0-15|21.14|17.21 cpu_0-31|63.68|36.25 **Why is it that the more the number of cores, the longer the time?**

mdouze commented 3 years ago

What does 200w mean? What is the number of threads used at train time, see faiss.omp_get_num_threads() 20 queries is too few to measure meaningful timings.

Aurevoir-68 commented 3 years ago

(1) 200w means 2 millions, d=512 nq=1000 m=64 nlist=2048

(2) docker commend: docker run --gpus "device=0" --cpuset-cpus="0-7"
docker run --gpus "device=0" --cpuset-cpus="0-15" docker run --gpus "device=0" --cpuset-cpus="0-31"

When using the above command, we print the value omp_get_num_threads() in clustering.cpp ,The results were as follows：8threads, 16threads and 32threads as we expect.

（3）We modify the value of nq from 20 to 1000. The training time of 8 threads, 16 threads and 32 threads are as follows: 8 threads： 55s 16 threads: 32s 32 threads: 197s

mdouze commented 3 years ago

I suspect there is something wrong with your docker install. Could you time a MKL matrix multiplication, like

import time
import numpy as np
import faiss

rs = np.random.RandomState(123)
A = rs.randn(2000, 10000)
B = rs.randn(10000, 2000)

for nt in 1, 2, 4, 8, 16, 24: 
    faiss.omp_set_num_threads(nt)
    t0 = time.time()
    C = A @ B     
    t1 = time.time()
    print(nt, t1 - t0)

On my 24-core machine it gives

1 2.2891054153442383
2 1.115039587020874
4 0.6096878051757812
8 0.29677295684814453
16 0.30524182319641113
24 0.32387566566467285

Aurevoir-68 commented 3 years ago

@mdouze When I test the above demo in docker, the results are as follows:

1 1.9535372257232666 2 1.1300079822540283 4 0.6262509822845459 8 0.3299119472503662 16 0.3111748695373535 32 0.28285908699035645

Does this mean that there is no problem with the installation of docker? There may be other problems?

Aurevoir-68 commented 3 years ago

We further test the C + + programs. The time consumption of testing different cores on the localhost is as follows:

cpu core	time
1	35.4856
2	18.0447
4	9.4889
8	5.1875
16	2.9397
32	1.9653

The time consumption of testing different cores on the docker container is as follows: cpu core	time
1	44.2805
2	22.2557
4	11.5767
8	6.4056
16	3.7844
32	2.9825

Does this mean that there is no problem with the installation of docker? maybe faiss?

Here is the c++ code for testing:

float fvec_inner_product_ref(const float * x, const float * y, size_t d)
{
       size_t i;
       float res = 0;
       for (i = 0; i < d; i++)
              res += x[i] * y[i];
       return res;
}

void my_inner_product(const float * x, const float * y, size_t d, size_t nq, size_t nb, float * res, int nt)
{
#pragma omp parallel for num_threads(nt)
       for (size_t i = 0; i < nq; i++)
       {
              for (size_t j = 0; j < nb; j++)
              {
                     //cout << "( " << i << " , " << j << ")" << " thread num: " << omp_get_thread_num() << endl;
                     res[i * nb + j] = fvec_inner_product_ref(x + i * d, y + j * d, d);
              }
       }
}

int main()
{
size_t d = 64;
size_t x_n = 20000;
size_t y_n = 40000;
...
for(int nt=1;nt<64;nt *=2)
{
    float * res = new float[x_n*y_n];
    double t0 = omp_get_wtime();
    my_inner_product(x,y,d,x_n,y_n,,res,nt);
    double t1 = omp_get_wtime();
 ...
}

}

jinwenabc commented 3 years ago

Same problem encountered， for IVF index training， the omp programming is used in compute centroids. seems no race condition here because faiss split data into different parts, every thread taking care only one. hope some one can give an answer

PXThanhLam commented 1 year ago

I run this script on my local docker container on mac( faiss version 1.7.2, python 3.6.8, faiss is install by pip, mac with 4 cpu cores) and the response is at follow : 1 0.4522852897644043 2 0.43656301498413086 4 0.43779563903808594 8 0.43766117095947266 16 0.4380381107330322 24 0.4380462169647217 Seems like increasing number of threads doesnt help at all. I run the code on local machine (without docker) and the response time behaves as I expected: Increase worker decrease process time. The behavior only happen inside docker container.

I suspect there is something wrong with your docker install. Could you time a MKL matrix multiplication, like

import time
import numpy as np
import faiss

rs = np.random.RandomState(123)
A = rs.randn(2000, 10000)
B = rs.randn(10000, 2000)

for nt in 1, 2, 4, 8, 16, 24: 
    faiss.omp_set_num_threads(nt)
    t0 = time.time()
    C = A @ B     
    t1 = time.time()
    print(nt, t1 - t0)

On my 24-core machine it gives

1 2.2891054153442383
2 1.115039587020874
4 0.6096878051757812
8 0.29677295684814453
16 0.30524182319641113
24 0.32387566566467285

JohnTailor commented 1 year ago

I can confirm that IndexIVFPQ search can get extremely slow (on CPU and Python ; memory is not fully utilized on my machine, I also don't use Docker)- it appears like race conditions (CPUs are stuck at max load), although since we are just searching there should be no need for locks. I need some further testing to really know the details.

mdouze commented 1 year ago

Please install with conda not pip.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

facebookresearch / faiss

The number of CPU cores affects the training and search time #1672

Summary

Platform

Reproduction instructions