Benchmarking index "drift" as new vectors are added?

erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python

http://ann-benchmarks.com

MIT License

4.74k stars 718 forks source link

Benchmarking index "drift" as new vectors are added? #419

Closed hweller1 closed 1 year ago

hweller1 commented 1 year ago

It's my understanding that as more data pts are vectorized after the initial index is built, many of these algorithms start to perform worse. In situations where you have a regularly updating dataset (e.g. Twitter) that you'd like to perform semantic search over, it would be nice to understand how these different algorithms stack up against each other in terms of index drift without having to entirely rebuild the index which could be prohibitively expensive.

erikbern commented 1 year ago

The benchmark right now just adds all vectors and then build the index.

I agree it would be nice to have a benchmark that adds / queries sequentially. But that would change the nature of the benchmarks quite a bit. Maybe later!