Open jrcavani opened 1 year ago
That is unexpected. We build very large HNSW indexes, up to 100M vectors. Building becomes very slow but I did not observe edge effects beyond a certain size.
Thank you for the fast response. If I have to guess it's probably tweakable through some OS setting related to memory and caching.
This screenshot depicts the CPU usage issue, red kernel threads dominating the green user threads, which used to be almost 100% when the index was smaller/fewer vectors.
How can I best debug and analyze the issue? Is there a similar HNSW implementation you recommend so I can have them side by side?
You can try the hnswlib implementation, that is the reference.
OK. I tried to run a side-by-side with hnswlib. I can reproduce the issue. It was performed on a Cloud instance with 32 physical AMD cores and 64 hyperthreads. Intel CPUs would do similar things. Both running the Python3.8, anaconda pytorch
channel pacakged Faiss 1.7.2.
In []: import numpy as np
In []: import time
In []: import faiss
In []: import hnswlib
In []: data = np.load("data.npy", mmap_mode="r")
# hnsw add
In []: num_elements = data.shape[0]
In []: ids = np.arange(num_elements)
In []: p = hnswlib.Index(space = 'ip', dim = 512) # possible options are l2, cosine or ip
In []: p.init_index(max_elements = num_elements, ef_construction = 200, M = 32)
In []: t = time.time(); p.add_items(data, ids); print(time.time() - t)
11217.803444623947
In []: p.max_elements
Out[]: 37949877
In []: p.set_ef(2048)
In []: %timeit p.knn_query(data[:10000], 5)
22.2 s ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Faiss add
In []: quantizer = faiss.index_factory(512, "HNSW32", faiss.METRIC_INNER_PRODUCT)
In []: quantizer.hnsw.efConstruction = 200
In []: t = time.time(); quantizer.add(xb); print(time.time() - t)
21627.71038222313
In []: quantizer.ntotal
Out[]: 37949877
In []: quantizer.hnsw.efSearch = 2048
In []: %timeit quantizer.search(data[:10000], 5)
2min 42s ± 3.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is what CPU usage looks like in htop
for hnswlib:
And this for Faiss HNSW32
:
Here are some single-core benchmarking at the same param settings with float32
vectors. Pretty baffled Faiss's search query took 1.72x that of the hnswlib, given one was modeled the other.
In [58]: p.set_num_threads(1)
In [59]: faiss.omp_set_num_threads(1)
In [60]: %timeit p.knn_query(data[:10], 5)
320 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit quantizer.search(data[:10], 5)
552 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [62]: %timeit p.knn_query(data[:50], 5)
1.69 s ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit quantizer.search(data[:50], 5)
2.92 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Summary
Hello, I think I've met an issue not reported before. I have been experimenting with a large HNSW index, d=512, ~30M vectors,
HNSW32,SQ8
, withefConstruction=100
or higher. It amounts to a little over 20GB in size. I'm using a cloud instance with 32 physical cores and 64 hyperthreads, Intel or AMD, and with 120GB ~ 590GB memory, ubuntu or RHEL-based (tested on different boxes).Everything was fine, all cores/threads firing 100% (green in
htop
meaning all user threads), and memory usage stable. As I kept adding new vectors in it, suddenly, around ~25GB in size, ~30M vectors, searches became twice to three times slower. The CPU utilization was less than total thread count, and there was a lot of red inhtop
indicating kernel threads firing. The residential memory started fluctuating within a 3GB range.I checked the SSD, not much io, and there was not swap set up. All was indicating memory access pattern issue. I found this article about Transparent Hugepage Support https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html, that some databases such as MongoDB and PostgresDB do not like it to be enabled. However, my situation actually improved if I went the opposite way and enabled "always" to THP, instead of
madvise
. Speed went up by about 1.5x, but there was still a lot of kernel threads activity, and memory still fluctuating.I've built a few indexes this way and all started to show degradation after a certain size limit. Has anyone met this, and is there an easy memory tuning setting for the OS I could try?
Platform
x86_64
OS: Ubuntu/RHEL-based
Faiss version: 1.7.2
Installed from: anaconda from
pytorch
channel, python3.8, faiss-cpu.Faiss compilation options: can confirm MKL
Running on:
Interface: