apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Adding bit vector support #13505

Open benwtrent opened 3 months ago

benwtrent commented 3 months ago

Description

In previous discussions that have occurred over various PRs and issues, we have talked about adding "hamming" distance as a new vector similarity function.

This idea was rejected as we currently have no simple way of deprecating and adjusting our supported vector similarities. enums are notoriously bad at allowing things to evolve as binary compatibility necessitates the types be immutable.

Bit-vectors aren't going away, so I wanted to broach the idea with two options, though, they have similar draw backs.

The first option is we simply add hamming distance once we move away from the enum types to an id/nominal based system for the similarities. cosine is already deprecated and we will remove it for v10. Though, hamming is only applicable to vectors encoded as byte[].

Another option is new BIT VectorEncoding. This incurs many of the same concerns around adding a new similarity as VectorEncoding is another enum, and it would require a new interface to the existing similarities (e.g. bitCompare).

For euclidean, we would do pop-count xor (aka hamming) as for bit vectors these two operations are equivalent.

For the dot-product set of similarities (cosine, dot product, max-inner product), we would do pop-count and.

Besides the fact that adding a new encoding is also updating an enum that is bwc forever, the storage for BIT vectors would be the same as BYTE, logic would have to be added to ensure the correct similarities are utilized when running vector comparisons.

jmazanec15 commented 3 months ago

Agree that bit vectors are going to remain. I think there are several use cases outside of semantic search as well, like image hash matching.

I am wondering if use cases will arise for different non-standard levels of precision, such as 2-bit and 4-bit and how these would fit in. It seems like to support these, we could generalize BYTE type to specify level of precision (I think this is implicit in approach 1). But this would probably add a good amount of branching to the implementation. That being said, I think I lean towards BIT VectorEncoding.

Also, for approach 1, do you know of any use cases for hamming distance on > 1 bit vectors?