apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Add nightly test that calculates recall for vector similarity spaces #13616

Open benwtrent opened 3 months ago

benwtrent commented 3 months ago

Description

We should have a nightly test that verifies expected recall for different vector spaces.

While in the past we have leaned on lucene nightly benchmarks to detect different vectors being returned, I am thinking it would be better to have an actual functional nightly test in Lucene itself.

Two ways to approach this:

I am not 100% sure we can utilize random data, but we may be able to. But we should have many vectors and test building all the supported vector formats and ensure recall is within some acceptable range.

Have static vectors associated with some "golden recall" to ensure it never changes and if it is, we must recalculate it.

It would be great if these static vectors are not unit vectors so that we adequately can test all our vector spaces.

tteofili commented 2 months ago

also related to this is supporting it in lucene_util: https://github.com/mikemccand/luceneutil/issues/278