Adding bit vector support

Description

In previous discussions that have occurred over various PRs and issues, we have talked about adding "hamming" distance as a new vector similarity function.

This idea was rejected as we currently have no simple way of deprecating and adjusting our supported vector similarities. enums are notoriously bad at allowing things to evolve as binary compatibility necessitates the types be immutable.

Bit-vectors aren't going away, so I wanted to broach the idea with two options, though, they have similar draw backs.

The first option is we simply add hamming distance once we move away from the enum types to an id/nominal based system for the similarities. cosine is already deprecated and we will remove it for v10. Though, hamming is only applicable to vectors encoded as byte[].

Another option is new BIT VectorEncoding. This incurs many of the same concerns around adding a new similarity as VectorEncoding is another enum, and it would require a new interface to the existing similarities (e.g. bitCompare).

For euclidean, we would do pop-count xor (aka hamming) as for bit vectors these two operations are equivalent.

For the dot-product set of similarities (cosine, dot product, max-inner product), we would do pop-count and.

Besides the fact that adding a new encoding is also updating an enum that is bwc forever, the storage for BIT vectors would be the same as BYTE, logic would have to be added to ensure the correct similarities are utilized when running vector comparisons.

apache / lucene

Adding bit vector support #13505

Description