apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 973 forks source link

Add per-field knn vector format info in SegmentInfo #13367

Closed tteofili closed 1 month ago

tteofili commented 1 month ago

When indexing vectors, it is possible to use different vector formats depending on the field; in addition to that it's also possible (although not currently implemented) to have Codecs that can provide different vector formats "dynamically" even for a same field. To better debug such situations, it would be helpful to have per field vector format information within SegmentCommitInfo (e.g. within the attributes).

This trivial PR adds KnnVectorFormat#name for each field to SegmentInfo#attributes in PerFieldKnnVectorsFormat. If a doc with field1 is indexed with Lucene99HnswVectorsFormat and a doc with field2 is indexed with Lucene99HnswScalarQuantizedVectorsFormat within the same segment, the correspondingSegmentInfo#attributes will have the following entries:

jpountz commented 1 month ago

I'm a bit confused: what is the benefit of having it on segment infos in addition to field infos?

tteofili commented 1 month ago

you're right @jpountz , we can probably get away with fieldInfo.getAttribute(PerFieldKnnVectorFormat.PER_FIELD_FORMAT_KEY), I didn't notice that, thanks!