erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.85k stars 727 forks source link

category filter #308

Open gtsoukas opened 2 years ago

gtsoukas commented 2 years ago

Would it be in the spirit of this benchmarks to add a second benchmark category for ANN in conjunction with categorial filters?

Most real-world applications of ANN will required category filtering e.g. when searching for cloths in an e-commerce scenario via ANN one might filter by gender (categorial) or availability (categorial).

There are several software products which allow combining ANN and category filters e.g. Apache Solr, Elasticsearch, Vertex AI Matching Engine, weaviate, qdrant. However, they mainly differ to this benchmarks in that they are managed services or just services but not embeddable libraries.

In addition to recall vs. queries per second there should be a view which filters to a fraction of the date vs. recall vs. queries per second. For the proprietary managed services, also a cost dimension might be useful.

I have found the following blog articles covering the topic:

Given that this would be very useful for practical implementations but also the fact that it significantly complicates the benchmarks I would be interested in your opinion and/or how I could help with it. Also I would be great to know if someone has already done such benchmarks.

erikbern commented 2 years ago

I think that would be interesting! I think the downside is

  1. Would make it more complex
  2. Not sure if there's any obvious public datasets for this?
gtsoukas commented 2 years ago

I think that would be interesting! I think the downside is

  1. Would make it more complex

Fully agree, probably the key reason not to do it.

  1. Not sure if there's any obvious public datasets for this?

Datasets from the existing benchmark could be reused if an additional artificial, categorial random variable is introduced, allowing to filter to fractions of the original dataset between 0-100%. The approach is described here: https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c

erikbern commented 2 years ago

if an additional artificial, categorial random variable is introduced, allowing to filter to fractions of the original dataset between 0-100%

I think that makes sense, but it would be nice if there's some more natural way to do it. Eg for the MNIST dataset, filtering by digit 0-9 could be nice.