Open benwtrent opened 5 months ago
Agree that bit vectors are going to remain. I think there are several use cases outside of semantic search as well, like image hash matching.
I am wondering if use cases will arise for different non-standard levels of precision, such as 2-bit and 4-bit and how these would fit in. It seems like to support these, we could generalize BYTE type to specify level of precision (I think this is implicit in approach 1). But this would probably add a good amount of branching to the implementation. That being said, I think I lean towards BIT VectorEncoding.
Also, for approach 1, do you know of any use cases for hamming distance on > 1 bit vectors?
Description
In previous discussions that have occurred over various PRs and issues, we have talked about adding "hamming" distance as a new vector similarity function.
This idea was rejected as we currently have no simple way of deprecating and adjusting our supported vector similarities.
enums
are notoriously bad at allowing things to evolve as binary compatibility necessitates the types be immutable.Bit-vectors aren't going away, so I wanted to broach the idea with two options, though, they have similar draw backs.
The first option is we simply add
hamming
distance once we move away from the enum types to an id/nominal based system for the similarities.cosine
is already deprecated and we will remove it for v10. Though,hamming
is only applicable to vectors encoded asbyte[]
.Another option is new
BIT
VectorEncoding. This incurs many of the same concerns around adding a new similarity as VectorEncoding is anotherenum
, and it would require a new interface to the existing similarities (e.g.bitCompare
).For
euclidean
, we would do pop-countxor
(akahamming
) as for bit vectors these two operations are equivalent.For the
dot-product
set of similarities (cosine, dot product, max-inner product), we would do pop-countand
.Besides the fact that adding a new encoding is also updating an
enum
that is bwc forever, the storage forBIT
vectors would be the same asBYTE
, logic would have to be added to ensure the correct similarities are utilized when running vector comparisons.