criteo / autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.
https://criteo.github.io/autofaiss/
Apache License 2.0
802 stars 74 forks source link

build_index take much more time when decreasing max_index_memory_usage #157

Open SingL3 opened 1 year ago

SingL3 commented 1 year ago

Hi, I first built index for 50000 x 512 embeddings using 4G max_index_memory_usage, which took about 90s. Then I tried 50M max_index_memory_usage, and it took like 17+ hours(not finished yet). Here is the log: image Is this working as expected? BTW, do you have a suggestion for the best setting of max_index_memory_usage? Thank you.

SingL3 commented 1 year ago

I interrupted it and the log went out like this: image

victor-paltz commented 1 year ago

Hello @SingL3!

The bottleneck is faiss here, the index.train(train_vectors) call is taking forever. The easiest way to speed it up is to use more CPU. You can have an idea of the training speed it should take (and more insights) by reading this great Medium article from @rom1504: https://rom1504.medium.com/semantic-search-at-billions-scale-95f21695689a. -> "For 400m embeddings it takes between 4h and 12h on a 16 cores machine depending on the product quantization parameter."

About the best setting of max_index_memory_usage, I would say you should use as much RAM as possible to get the best performance, so you should put 16GB or 32GB if you can. You can have a look at the recall metrics computed at the end of the construction of the index to make sure your index is as good as you want. (You have 190GB of data, you can safely compress to a factor 16 without losing too much quality if you use clip embeddings)

I hope it helps!

SingL3 commented 1 year ago

Hello @victor-paltz, I am using 12 cores and I am testing on only 50k embeddings which should not take that much time. So i think it is stuck.

Actually, the reason that I ask the best setting of max_index_memory_usage is that a smaller one would result in a better compression ratio. And a better compression ratio means less disk usage and also in this case, less RAM.

victor-paltz commented 1 year ago

I just ran it on my computer (16 cores), and it took me 27 minutes to train the index. Have you tried to train it a second time?

from autofaiss.indices.index_factory import index_factory
import faiss
import numpy as np

d = 512
embeddings = np.float32(np.random.rand(500_000, d))
index_key = "OPQ256_768,IVF1024_HNSW32,PQ256x8"

index_with_autofaiss = faiss.index_factory(d, index_key, faiss.METRIC_INNER_PRODUCT)

start_time = time.time()
index_with_autofaiss.train(embeddings[:50_000])
end_time = time.time()
elapsed_time_autofaiss = end_time-start_time
print(elapsed_time_autofaiss)

For information, you can also keep your index on disk and still have good query performance. So if too much compression has an impact on the quality of your embeddings, you could use that solution too.

SingL3 commented 1 year ago

Hi, @victor-paltz. I tried for several times after that and I found that I have to set the nb_cores(like --nb_cores=12). If I kept it unset, it will be None and use multiprocessong.cpu_count()(I got 128 here). As what I mention here

I interrupted it and the log went out like this: image

I interrupted the processes that are stuck and they all seems like stunk at swigfaiss_avx2.py