Indexing float vectors in Lucene

thibault-formal commented 3 years ago

hi, First, thanks for the great package !

I would like to know if it is possible in Lucene to store custom float weights for terms; for instance, the use case would be to index the representations obtained with DeepCT (instead of rounding the weights, repeat the latent terms with the new term frequencies and index the new collection). To be clear, let's say there is a float weight associated to each term in my documents; I want to know if I can index this, and do the retrieval based on dot product for instance. I know little about Lucene, so I don't know if there is any fundamental reason that would discourage this (for instance, issues with efficiency).

some pointers in #1353 , but only for integer weights I think
it seems that this could be doable using payloads (e.g. https://stackoverflow.com/questions/32663537/inserting-a-termvector-into-lucene), but I am not entirely sure ?

Thanks in advance for your feedback

lintool commented 3 years ago

I believe the way that DeepCT actually does this is to create fake documents and duplicate the terms. So if a term has weight 3, they just duplicate it 3 times. And then just take these fake documents and index in Lucene as before. DeepCT normalizes weights to an integer range, so this works fine.

Lucene then indexes the "weight" as its term frequency - and you change the similarity (i.e., scoring function) at search time to whatever you want.

Alternatively, look at this: https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md

thibault-formal commented 3 years ago

hi @lintool, thanks for your answer !

indeed, this is what is done in DeepCT. But you have to convert weights to integers, which:
- might lead to drop in performance (I don't know to which extent though)
- is not very principled: the rounding scheme is very ad-hoc

So that's why I was wondering is there is an easy way to keep the float information in Lucene.

thanks for the pointer; the fake word approach is similar to the approach in DeepCT then.

So just to be clear; there is no easy way in Lucene to index a collection where instead of storing the term-frequency for each document term, we would store a float value (a weight for this term, pre-computed with a model like DeepCT) ? And the most reasonable way to proceed is to resort on rounding those values and go for the fake word approach ?

lintool commented 3 years ago

That is correct.

For what you want, you could use hnsw, as we did here: https://dl.acm.org/doi/abs/10.1145/3409256.3409818

We indexed BM25 vectors using hnsw, but you could generate your own arbitrary vectors.

thibault-formal commented 3 years ago

ok great, thanks for the clarifications !

tteofili commented 3 years ago

you can actually do scoring with Payloads (e.g. https://stackoverflow.com/questions/6493249/lucene-payload-scoring) but using fake words would be more space and time efficient most of the times.

On another point you could index FloatPoints up to 8 dimensions and perform NN in Lucene (uses KD-Tree) with https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/document/FloatPointNearestNeighbor.java.

thibault-formal commented 3 years ago

thank you both ! I have a few last questions:

I quickly checked your paper Lucene for Approximate Nearest Neighbors Search on Arbitrary Dense Vectors, and I wanted to be sure I understood what you did:

just to be sure: for the fake word approach, basically each dimension in the embedding space corresponds to a fake/latent word right ? So it means that the "vocabulary" size is equal to the embedding dimension ?
because everything is dense, it means that the size of the posting list for each fake word == the size of the collection ? Is this problematic efficiency-wise ?
you mention in the paper "we observe a large number of terms that are generated at indexing time, which significantly reduces search performance. To combat this, we filter highly-frequent terms at search time." Could you give a bit more details about this ? Especially, what is the typical value for the threshold ?

tteofili commented 3 years ago

just to be sure: for the fake word approach, basically each dimension in the embedding space corresponds to a fake/latent word right ? So it means that the "vocabulary" size is equal to the embedding dimension ?

yes, correct

because everything is dense, it means that the size of the posting list for each fake word == the size of the collection ? Is this problematic efficiency-wise ?

we didn't perform a specific analysis of the efficiency impact of this aspect, however this can compare to the case of an overly common word in a text dataset, you'd end up with a long postinglist mentioning most of the documents in the collection. Lucene is expected to handle this well, currently it leverages an improved version of lz4 for integers.

you mention in the paper "we observe a large number of terms that are generated at indexing time, which significantly reduces search performance. To combat this, we filter highly-frequent terms at search time." Could you give a bit more details about this ? Especially, what is the typical value for the threshold ?

this is addressed in the following lines: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/ann/ApproximateNearestNeighborSearch.java#L173-L181

castorini / anserini

Indexing float vectors in Lucene #1452