apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Can/should `KnnByte/FloatVectorQuery` carry some human-meaningful opaque `toString` fragment? #12487

Closed mikemccand closed 10 months ago

mikemccand commented 1 year ago

Description

Over in https://github.com/mikemccand/luceneutil/issues/226 while trying to fix a sneaky and long-standing Lucene nightly benchmark non-determinism that affected VectorSearch and some *TaxoFacets performance measures, I struggled and failed/cheated to pick which VectorSearch queries to keep for disambiguation.

The tasks file has:

VectorSearch: vector//publisher backstory # freq=194856 freq=148
VectorSearch: vector//many geografia # freq=99550 freq=104
VectorSearch: vector//many foundation # freq=99550 freq=10894
VectorSearch: vector//this school # freq=238551 freq=29912
VectorSearch: vector//such 2007 # freq=111526 freq=90200 1.2
VectorSearch: vector//year work # freq=175324 freq=102732 1.7
VectorSearch: vector//interviews # freq=31768
VectorSearch: vector//golf # freq=31760
VectorSearch: vector//http # freq=389790

The benchy then computes embeddings from each of these lexical terms, and creates KnnFloatVectorQuery for each.

But then later, if something goes wrong, the toString of these queries just renders the first dimension float:

TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector[0.02625591,...][100] s=null group=null hits=100 facets=[]

I realize from the machine's standpoint it really is only this vector that "matters", but we humans still think in terms of words (so far, anyways, heh). Could we maybe allow for an optional opaque and not counting towards hashCode/equals/etc. string that is then regurgitated back out in toString to help we humans that still need to interact with the machines?

If we had this, I could have made the correct fix over in https://github.com/mikemccand/luceneutil/issues/226 to try to gain back some continuity in the vector nightly charts. But instead I just picked the top 5 vector queries, which is most likely wrong. Also, there is precedent in Lucene for such "opaque for-human strings": the String resourceDescription passed to base IndexInput constructor.

slow-J commented 11 months ago

I think that we could simply add an resourceDescription field to the AbstractKnnVectorQuery and modify the toString in the implementations so that the output would look something like examples:

resourceDescription = "publisher backstory" TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector:publisher backstory[0.02625591,...][100] s=null group=null hits=100 facets=[]

resourceDescription = "" TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector:[0.02625591,...][100] s=null group=null hits=100 facets=[]

resourceDescription = null TASK: cat=VectorSearch q=KnnFloatVectorQuery:vector:null[0.02625591,...][100] s=null group=null hits=100 facets=[]

Would this allow us to move forward with the benchmarker fix in https://github.com/mikemccand/luceneutil/issues/226 ?