Closed Aurevoir-68 closed 3 weeks ago
What does 200w mean?
What is the number of threads used at train time, see faiss.omp_get_num_threads()
20 queries is too few to measure meaningful timings.
(1) 200w means 2 millions, d=512 nq=1000 m=64 nlist=2048
(2) docker commend:
docker run --gpus "device=0" --cpuset-cpus="0-7"
docker run --gpus "device=0" --cpuset-cpus="0-15"
docker run --gpus "device=0" --cpuset-cpus="0-31"
When using the above command, we print the value omp_get_num_threads() in clustering.cpp ,The results were as follows:8threads, 16threads and 32threads as we expect.
(3)We modify the value of nq from 20 to 1000. The training time of 8 threads, 16 threads and 32 threads are as follows: 8 threads: 55s 16 threads: 32s 32 threads: 197s
I suspect there is something wrong with your docker install. Could you time a MKL matrix multiplication, like
import time
import numpy as np
import faiss
rs = np.random.RandomState(123)
A = rs.randn(2000, 10000)
B = rs.randn(10000, 2000)
for nt in 1, 2, 4, 8, 16, 24:
faiss.omp_set_num_threads(nt)
t0 = time.time()
C = A @ B
t1 = time.time()
print(nt, t1 - t0)
On my 24-core machine it gives
1 2.2891054153442383
2 1.115039587020874
4 0.6096878051757812
8 0.29677295684814453
16 0.30524182319641113
24 0.32387566566467285
@mdouze When I test the above demo in docker, the results are as follows:
1 1.9535372257232666 2 1.1300079822540283 4 0.6262509822845459 8 0.3299119472503662 16 0.3111748695373535 32 0.28285908699035645
Does this mean that there is no problem with the installation of docker? There may be other problems?
We further test the C + + programs. The time consumption of testing different cores on the localhost is as follows:
cpu core | time |
---|---|
1 | 35.4856 |
2 | 18.0447 |
4 | 9.4889 |
8 | 5.1875 |
16 | 2.9397 |
32 | 1.9653 |
The time consumption of testing different cores on the docker container is as follows: cpu core | time |
---|---|
1 | 44.2805 |
2 | 22.2557 |
4 | 11.5767 |
8 | 6.4056 |
16 | 3.7844 |
32 | 2.9825 |
Does this mean that there is no problem with the installation of docker? maybe faiss?
Here is the c++ code for testing:
float fvec_inner_product_ref(const float * x, const float * y, size_t d)
{
size_t i;
float res = 0;
for (i = 0; i < d; i++)
res += x[i] * y[i];
return res;
}
void my_inner_product(const float * x, const float * y, size_t d, size_t nq, size_t nb, float * res, int nt)
{
#pragma omp parallel for num_threads(nt)
for (size_t i = 0; i < nq; i++)
{
for (size_t j = 0; j < nb; j++)
{
//cout << "( " << i << " , " << j << ")" << " thread num: " << omp_get_thread_num() << endl;
res[i * nb + j] = fvec_inner_product_ref(x + i * d, y + j * d, d);
}
}
}
int main()
{
size_t d = 64;
size_t x_n = 20000;
size_t y_n = 40000;
...
for(int nt=1;nt<64;nt *=2)
{
float * res = new float[x_n*y_n];
double t0 = omp_get_wtime();
my_inner_product(x,y,d,x_n,y_n,,res,nt);
double t1 = omp_get_wtime();
...
}
}
Same problem encountered, for IVF index training, the omp programming is used in compute centroids. seems no race condition here because faiss split data into different parts, every thread taking care only one. hope some one can give an answer
I run this script on my local docker container on mac( faiss version 1.7.2, python 3.6.8, faiss is install by pip, mac with 4 cpu cores) and the response is at follow : 1 0.4522852897644043 2 0.43656301498413086 4 0.43779563903808594 8 0.43766117095947266 16 0.4380381107330322 24 0.4380462169647217 Seems like increasing number of threads doesnt help at all. I run the code on local machine (without docker) and the response time behaves as I expected: Increase worker decrease process time. The behavior only happen inside docker container.
I suspect there is something wrong with your docker install. Could you time a MKL matrix multiplication, like
import time import numpy as np import faiss rs = np.random.RandomState(123) A = rs.randn(2000, 10000) B = rs.randn(10000, 2000) for nt in 1, 2, 4, 8, 16, 24: faiss.omp_set_num_threads(nt) t0 = time.time() C = A @ B t1 = time.time() print(nt, t1 - t0)
On my 24-core machine it gives
1 2.2891054153442383 2 1.115039587020874 4 0.6096878051757812 8 0.29677295684814453 16 0.30524182319641113 24 0.32387566566467285
I can confirm that IndexIVFPQ search can get extremely slow (on CPU and Python ; memory is not fully utilized on my machine, I also don't use Docker)- it appears like race conditions (CPUs are stuck at max load), although since we are just searching there should be no need for locks. I need some further testing to really know the details.
Please install with conda not pip.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Summary
Platform
OS:
Faiss version: 1.6.3
Installed from: source
Faiss compilation options:
Running on:
Interface:
Reproduction instructions
We verify the training and search time by binding CPU cores in the docker environment. The environment is as follows: number of cpu: 2( 8 cores, 16threads) docker commend: 1. docker run --gpus "device=0" --cpuset-cpus="0-7" 2. docker run --gpus "device=0" --cpuset-cpus="0-15" 3. docker run --gpus "device=0" --cpuset-cpus="0-31" We run 3-IVFPQ.cpp in "faiss/tutorial/cpp". The results are as follows: **nb:200w random data nlist:2048 nq:20 random data** cpu core | train time(s) | search time(ms) :---:|:---:|:---: cpu_0-7| 33.98| 17.42 cpu_0-15|21.14|17.21 cpu_0-31|63.68|36.25 **Why is it that the more the number of cores, the longer the time?**