alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
https://alexklibisz.github.io/elastiknn
Apache License 2.0
370 stars 48 forks source link

Q&A: Scale effects #659

Closed ezorita closed 6 months ago

ezorita commented 6 months ago

Hi @alexklibisz

first of all thanks for your time and dedication to build elastiknn.

I'd like to share our use case and the scaling behavior we are observing. We have indexed about 150M documents in an elasticsearch cluster, including both document text and a 768-dimensional vector for each document. We are considering using elastiknn because it seems to be the only option in elasticsearch that can be combined with any arbitrary boolean queries. However, we are experiencing scaling issues. In a small corpus of 500k documents it's blazing fast, when we go up to 150M it takes about a minute to run an elastiknn query (even without further filtering).

I suspect this might be a memory issue, so I have a few questions related to this:

Many thanks again!

alexklibisz commented 6 months ago

Hi @ezorita, these are some good questions. I'll try to answer below.

However, we are experiencing scaling issues. In a small corpus of 500k documents it's blazing fast, when we go up to 150M it takes about a minute to run an elastiknn query (even without further filtering).

150M is more than I've ever tested with. It's not surprising that it takes longer, but 60s sounds like it might just lack resources for that amount of data. I'm assuming that these are LSH (approximate) queries. As a sanity check, how long does it take to run a standard term query on 150M documents with your current infrastructure? Any Elastiknn vector query is basically matching a bunch of terms, so the vector query will necessarily be slower than a term query. I would set a baseline with term queries. More tips below.

(even without further filtering)

The filtering is strictly pre-filtering. So filtering should actually improve performance. More here: https://alexklibisz.github.io/elastiknn/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents

What is the vector index size of elastiknn given a number of vectors and a vector size, for LSH and cosine similarity?

It depends on the number of vectors and the LSH parameters. For cosine LSH, index size will scale with the L and k parameters. L is the number of hashes that get stored to represent a vector. k is the length of each hash. For the cosine LSH model, the math is pretty simple: we store L hashes and k bits for each hash. Vectors are stored as 32-bit floats. So your index should be roughly number of vectors * (L * k bits + dimensions * 32 bits per float). So an index of 150M 1024-dimensional vectors, with L = 100, k = 5 would have size: 150 million * (100 * 5 bits + 1024 * 32 bits), roughly 624GB.

Does elastiknn expect the whole vector index to be in memory at all times?

Ideally yes. But this is more of a general concern with Elasticsearch and Lucene. AFAIK, for low-latency search, the index files should ideally be cached in the file system cache. You can monitor IOPs or similar metrics to verify that you're reading from memory and not from disk/ssd.

If so, is there a way to preload the index in memory and keep it there for as long as elasticsearch runs?

If it has the space, the operating system should eventually cache the index files in file system cache automatically.

Elasticsearch has some advanced settings to control this more precisely, e.g., https://www.elastic.co/guide/en/elasticsearch/reference/current/preload-data-to-file-system-cache.html. I haven't tried this. I usually just trust that if I've provided enough system (non-JVM) memory, then the file system will cache the index files.

Have you experienced such scaling issues before? Do you know the most common causes?

I haven't really pushed elastiknn anytime recently. I've been benchmarking mostly with the Fashion Mnist dataset, which is ~60k vectors.

My general advice is the following:

If I understood correctly, you mentioned you'd not continue improving elastiknn further since Elasticsearch is implementing a pretty sophisticated vector search engine.

Yeah, I don't have any plans to add functionality. I've been tinkering with performance when I have the time and when I have ideas. Mostly just because I'm interested in performance optimization. If someone is interested in adding functionality to Elastiknn, I would review the PRs. I would also have a high standard for including a new feature. I don't want it to be flakey or a burden to test/maintain.

However, I wonder what is different between elastiknn and ES' native vector engine, so that the later does not support arbitrary filter queries.

I haven't looked at ES' native vector search in a long time, so I'm not familiar with the features. If they don't offer pre-filtering, then it's probably not a fundamental limitation. Elastiknn has had pre-filtering since 2020, implemented with existing Elasticsearch and Lucene APIs.

At a strategic level, the difference is that Elasticsearch is using the HNSW model for ANN, which is built into Lucene as a dedicated feature. Whereas Elastiknn is using the LSH model fro ANN, based on standard Lucene term queries: convert the vector to a set of hashes; store each hash as a term; use existing APIs to query for terms. On benchmarks, HNSW seems to be much better than LSH. I haven't seen a direct comparison of Elastiknn LSH vs. Elasticsearch's HNSW. I'd be very interested, and I hope they would be much faster given the amount of effort devoted to this in Lucene the past ~5 years.

I hope that helps!

alexklibisz commented 6 months ago

Converting this to a discussion. Still trying to decide exactly how to distinguish Issues vs. Discussions, but this feels more like a discussion than a specific issue to resolve or implement.