Why is index training 40x slower on computer with similar hardware?

billkle1n commented 7 years ago

I ran the following code on my laptop (a MacBook Pro Retina, 15-inch, Mid 2015, 2.2 GHz Intel Core i7) and on a p2.xlarge AWS EC2 instance (Ubuntu 16.04 AMI). It is almost 40x slower on the AWS machine and I can't figure out why.

One noticeable difference is my Macbook does not have an NVIDIA GPU so I didn't compile GPU faiss whereas it is compiled on the AWS machine. That said, I'm not explicitly using the GPU in the test so it shouldn't really matter, right? And even if it was silently switching to a GPU based K-means on the EC2 instance, wouldn't that be faster not slower?

The only thing that I can think could make a difference is the libraries/flags that were used when compiling (here's part of the script I used on AWS - I can clean it up and share the full version if it helps)?

Here's the python code:

import pytest  # NOQA
import faiss
import numpy as np
from sklearn.preprocessing import normalize
import time

def l2_normalize(v):
    return normalize(v, norm='l2')

def create_index(
    d=128, nlists=8, M=32, nbits=8, metric_type=faiss.METRIC_INNER_PRODUCT
):
    '''
    Parameters:
        d (default: 128):
            dimension of vector aka global descriptor (default: 128)
        nlists (default 256):
            number of clusters for the coarse quantizer aka buckets/lists
        M (default: 32):
            number of subquantizers (d should be a multiple of M)
        nbits (default: 8):
            number of bits per subvector
            aka log2(# of clusters per subquantizer)
    '''

    # Used to assign vectors in one of `nlists` lists
    coarse_quantizer = faiss.IndexFlatL2(d)
    # Main index
    index = faiss.IndexIVFPQ(
        # coarse quantization / IVF related params
        coarse_quantizer, d, nlists,
        # PQ related params
        M, nbits
    )

    # We have to reference the coarse quantizer on the main index object to
    # avoid the coarse quantizer from being garbage collected by Python and
    # resulting in a later segfault when Faiss tries to access it internally.
    index.coarse_quantizer = coarse_quantizer

    index.metric_type = metric_type

    return index

def test_create_index_and_train():
    d = 128
    nlists = 8
    index = create_index(d=d, nlists=nlists)
    np.random.seed(123)
    training_vectors = np.random.randn(266, d).astype(np.float32)

    index.train(l2_normalize(training_vectors))

    assert(index.is_trained)
    assert(index.code_size == 32)
    assert(index.d == 128)
    assert(index.nlist == nlists)
    assert(index.nprobe == 1)
    assert(index.by_residual)
    assert(index.metric_type == faiss.METRIC_INNER_PRODUCT == 0)
    assert(index.ntotal == 0)
    assert(index.use_precomputed_table == 1)

    index.display()
    index.print_stats()
    index.verbose = True

    test_vectors = np.random.randn(3, d).astype(np.float32)
    normed_test_vectors = l2_normalize(test_vectors)
    index.add(normed_test_vectors)
    assert(index.ntotal == 3)
    assert(index.imbalance_factor() > 1)

    search_vectors = l2_normalize(np.array([test_vectors[2]]))
    k = 5
    scores, ids = index.search(search_vectors, k)

    # IDs set to -1 if invalid result...
    valid_ids = ids > -1
    scores = scores[valid_ids]
    ids = ids[valid_ids]

    print('index.ntotal =', index.ntotal)

    assert(scores[0] > 0.5)

    index.reset()
    assert(index.is_trained)
    assert(index.ntotal == 0)

if __name__ == '__main__':
    start = time.time()
    test_create_index_and_train()
    end = time.time()
    print('done in {}s'.format(end-start))

Logs on my Macbook Pro:

$ python tests/slow_test.py
Failed to load GPU Faiss: No module named 'swigfaiss_gpu'
Faiss falling back to CPU-only.
WARNING clustering 266 points to 8 centroids: please provide at least 312 training points
[... removed a bunch of warnings ...]
Index: N5faiss10IndexIVFPQE  -> 0 elements
list size in < 1: 8 instances
 add_core times: 0.033 0.079 0.017
index.ntotal = 3
done in 0.6394319534301758s

Logs on AWS:

$ python tests/slow_test.py
WARNING clustering 266 points to 8 centroids: please provide at least 312 training points
[... removed a bunch of warnings ...]
Index: N5faiss10IndexIVFPQE  -> 0 elements
list size in < 1: 8 instances
 add_core times: 2.143 4.513 0.004
index.ntotal = 3
done in 23.92768096923828s

Edit: here's a log of how I compiled faiss, just re-compiled again on EC2 machine: https://gist.github.com/anonymous/6cf10f15d1d6b9b45f63dad6a0b89873

billkle1n commented 7 years ago

I tried disabled the GPU code (by renaming swigfaiss_gpu.py in site-packages) on the EC2 machine but that didn't have a noticeable difference on speed (which I expected since I'm not explicitly using a GPU index).

$ python tests/slow_test.py
Failed to load GPU Faiss: No module named 'swigfaiss_gpu'
Faiss falling back to CPU-only.
WARNING clustering 266 points to 8 centroids: please provide at least 312 training points
WARNING clustering 266 points to 256 centroids: please provide at least 9984 training points
[... removed a bunch of warnings ...]
Index: N5faiss10IndexIVFPQE  -> 0 elements
list size in < 1: 8 instances
 add_core times: 4.160 4.532 0.003
index.ntotal = 3
done in 26.228580951690674s

billkle1n commented 7 years ago

I recompiled faiss with the Intel MKL library on the AWS EC2 machine and it's a lot faster:

$ python tests/slow_test.py
WARNING clustering 266 points to 8 centroids: please provide at least 312 training points
[... removed a bunch of warnings ...]
Index: N5faiss10IndexIVFPQE  -> 0 elements
list size in < 1: 8 instances
 add_core times: 4.157 1.822 0.004
index.ntotal = 3
done in 0.2357316017150879s

Is OpenBLAS just really that much slower than "Intel MKL" and "Apple's framework accelerate"?

mdouze commented 7 years ago

Hi

The BLAS implementation matters a lot. See the comments in the install file about our findings on the relative speed of MKL, OpenBLAS and Accelerate. Also note that OpenBLAS has an interaction problem with OpenMP:

https://github.com/facebookresearch/faiss/wiki/Troubleshooting#slow-brute-force-search-with-openblas

billkle1n commented 7 years ago

Thanks, looks like export OMP_WAIT_POLICY=PASSIVE significantly sped up the test as well with OpenBLAS.

gf0507033 commented 6 years ago

@billkle1n could you provide your setup script for MKL & python? I can compile faiss with mkl but get segfault in runtime.

billkle1n commented 6 years ago

@gf0507033 I believe the segfault is unrelated to MKL (I used the MKL installation script from Intel and uncommented relevant lines in the Faiss makefile). It's a known issue with the way the Faiss Python bindings are implemented and I believe the cause is that the Python runtime sometimes deletes (garbage collects) objects that other Faiss structures are still referencing from the C++ code. In general the solution is to use the index_factory function.

davideuler commented 1 year ago

I had came across the same problem on Linux. On linux it indexing 1 million vectors cost 120 minutes, while on Mac Pro Intel 2020 it is 2.5 minutes. it is about 50x slower on linux than that on Mac Pro Intel.

And after rebuild faiss with intel mkl support, it cost only 2 minutes to index 1 million vectors.

My distribution is CentOS 7, and mkl is installed by yum.

yum-config-manager --add-repo https://yum.repos.intel.com/mkl/setup/intel-mkl.repo
yum install -y intel-mkl

facebookresearch / faiss

Why is index training 40x slower on computer with similar hardware? #201