jrcavani commented 1 year ago

Summary

Hello, I think I've met an issue not reported before. I have been experimenting with a large HNSW index, d=512, ~30M vectors, HNSW32,SQ8, with efConstruction=100 or higher. It amounts to a little over 20GB in size. I'm using a cloud instance with 32 physical cores and 64 hyperthreads, Intel or AMD, and with 120GB ~ 590GB memory, ubuntu or RHEL-based (tested on different boxes).

Everything was fine, all cores/threads firing 100% (green in htop meaning all user threads), and memory usage stable. As I kept adding new vectors in it, suddenly, around ~25GB in size, ~30M vectors, searches became twice to three times slower. The CPU utilization was less than total thread count, and there was a lot of red in htop indicating kernel threads firing. The residential memory started fluctuating within a 3GB range.

I checked the SSD, not much io, and there was not swap set up. All was indicating memory access pattern issue. I found this article about Transparent Hugepage Support https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html, that some databases such as MongoDB and PostgresDB do not like it to be enabled. However, my situation actually improved if I went the opposite way and enabled "always" to THP, instead of madvise. Speed went up by about 1.5x, but there was still a lot of kernel threads activity, and memory still fluctuating.

I've built a few indexes this way and all started to show degradation after a certain size limit. Has anyone met this, and is there an easy memory tuning setting for the OS I could try?

Platform

x86_64

OS: Ubuntu/RHEL-based

Faiss version: 1.7.2

Installed from: anaconda from pytorch channel, python3.8, faiss-cpu.

Faiss compilation options: can confirm MKL

Running on:

[x] CPU
[ ] GPU

Interface:

[ ] C++
[x] Python

mdouze commented 1 year ago

That is unexpected. We build very large HNSW indexes, up to 100M vectors. Building becomes very slow but I did not observe edge effects beyond a certain size.

jrcavani commented 1 year ago

Thank you for the fast response. If I have to guess it's probably tweakable through some OS setting related to memory and caching.

This screenshot depicts the CPU usage issue, red kernel threads dominating the green user threads, which used to be almost 100% when the index was smaller/fewer vectors. Screen Shot 2022-09-23 at 2 24 53 PM

How can I best debug and analyze the issue? Is there a similar HNSW implementation you recommend so I can have them side by side?

mdouze commented 1 year ago

You can try the hnswlib implementation, that is the reference.

jrcavani commented 1 year ago

OK. I tried to run a side-by-side with hnswlib. I can reproduce the issue. It was performed on a Cloud instance with 32 physical AMD cores and 64 hyperthreads. Intel CPUs would do similar things. Both running the Python3.8, anaconda pytorch channel pacakged Faiss 1.7.2.

In []: import numpy as np
In []: import time
In []: import faiss
In []: import hnswlib

In []: data = np.load("data.npy", mmap_mode="r")

# hnsw add
In []: num_elements = data.shape[0]
In []: ids = np.arange(num_elements)

In []: p = hnswlib.Index(space = 'ip', dim = 512) # possible options are l2, cosine or ip
In []: p.init_index(max_elements = num_elements, ef_construction = 200, M = 32)

In []: t = time.time(); p.add_items(data, ids); print(time.time() - t)
11217.803444623947

In []: p.max_elements
Out[]: 37949877

In []: p.set_ef(2048)
In []: %timeit p.knn_query(data[:10000], 5)
22.2 s ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Faiss add
In []: quantizer = faiss.index_factory(512, "HNSW32", faiss.METRIC_INNER_PRODUCT)
In []: quantizer.hnsw.efConstruction = 200

In []: t = time.time(); quantizer.add(xb); print(time.time() - t)
21627.71038222313

In []: quantizer.ntotal
Out[]: 37949877

In []: quantizer.hnsw.efSearch = 2048
In []: %timeit quantizer.search(data[:10000], 5)
2min 42s ± 3.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is what CPU usage looks like in htop for hnswlib: hnswlib_index

And this for Faiss HNSW32: faiss_index

jrcavani commented 1 year ago

Here are some single-core benchmarking at the same param settings with float32 vectors. Pretty baffled Faiss's search query took 1.72x that of the hnswlib, given one was modeled the other.

In [58]: p.set_num_threads(1)

In [59]: faiss.omp_set_num_threads(1)

In [60]: %timeit p.knn_query(data[:10], 5)
320 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [61]: %timeit quantizer.search(data[:10], 5)
552 ms ± 9.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [62]: %timeit p.knn_query(data[:50], 5)
1.69 s ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [63]: %timeit quantizer.search(data[:50], 5)
2.92 s ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

facebookresearch / faiss

HNSW index gets very slow at search upon adding vectors up to ~25GB #2490

Summary

Platform