Scope distributing a filtered search index

breezykermo commented 3 weeks ago

In an attempt to narrow the scope of our semester's project, we are considering whether it might be interesting to consider the feasibility of distributing an index modified for filtered ANNS, such as ACORN.

breezykermo commented 3 weeks ago

After reading and discussing the paper, we came up with an idea for an extension of ACORN that we think is well-scoped as a system that we could build over the course of the rest of the semester.

OAK - Opportunistic ANN K-subgraphs

ACORN is an adaptation of HNSW that allows predicate-agnostic filtered searches with better latency than pre-filtering (brute-force over all records that match a predicate) and post-filtering (performing regular HNSW search and then filtering by predicate) for low-selectivity queries in particular. Its core mechanism is additional neighbor expansion during search, which retrieves a greater number of candidates that match a predicate with low-selectivity, a mechanism that approximates the latency of the 'oracle partition index'. This oracle index is the HNSW graph that could have been built by taking the subgraph of records that match a particular predicate.

OAK is a system that opportunistically builds such oracle indexes to speed up queries with common predicates. We define an index whose construction is opportunistic by way of the following qualities:

The cost of building the index (in both CPU time and memory space) will not impact the latency of any other query.
There is an indication from the query history that the existence of an index could speed up future queries.

When a query is received, OAK considers whether any opportunistic indexes exist that could accelerate it. By default, and when no such opportunistic index exists, it uses the 'seed' index, an ACORN-style HNSW index.

breezykermo commented 3 weeks ago

Here's the implementation strategy for OAK I would suggest.

Create Rust bindings for the ACORN research code. (ACORN is built on top of FAISS.) An alternative solution here would be to attempt to re-implement ACORN on top of a pure Rust HNSW implementation, which might not be as time-effective; but could be more interesting/fun.
Set up a benchmarking suite for the 5 datasets used in the ACORN paper.
Design and implement an interface for an OpportunismStrategy. Given that there are many different ways one could imagine triggering an index construction, and parameterizing when it makes sense to do so, I think we want this abstracted so that we can try different strategies.
Reason through what kind of load we need to make effective use of the MVP OpportunismStrategy. It may make sense to reopen the discussion in #17 here, as we now want loads that simulate real-world in the sense that there are periods with less queries, such that there is spare CPU to build OIs. Alternatively, we could begin with the distributed opportunism we discussed: where OIs are only ever built on nodes not in the query path, and OAK makes use of additional hardware up until some threshold set by a user.
Benchmark against ACORN baselines.

breezykermo commented 3 weeks ago

This idea also raises a reasonable critique of ACORN and other ANNS evaluations. They don't take into account queries sent over time, which is closer to a real-world load.

breezykermo commented 3 weeks ago

MVP:

User-specified predicates in OpportunismStrategy
Single node, as OIs are built essentially in advance, and so we don't need to worry (at this stage) about having enough spare CPU/memory.
We should analyze the 5 datasets in the ACORN paper and try to work out OIs that we can build in advance to optimize them.

breezykermo / oak

Scope distributing a filtered search index #25

OAK - Opportunistic ANN K-subgraphs