apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

int4 doesn't handle `cosine` similarity correctly #13614

Closed benwtrent closed 2 months ago

benwtrent commented 2 months ago

Description

just with some default settings, glove-200 does poorly with int4 HNSW and when using cosine. The bug occurs on merge. When recalculating the quantiles, the vectors aren't normalized like they should, so the quantiles will get all out of whack. We can actually see this in some of the below experiments. All these values should indicate a normalized vectory, however some of the results are > 1 or < -1.

Some experiments I have done:

Dynamic confidence interval:

0.5 confidence interval (locally patched to allow it):

0.75 confidence interval (locally patched to allow it):

0.9 confidence interval:

benwtrent commented 2 months ago

Actually, this might be a bug. Looking at the code, I am not sure we normalize the vectors when building the quantizer & using cosine.

benwtrent commented 2 months ago

Yep, not normalizing during quantile merging is the actual bug. I have a local patch that fixes this. Need to write up some tests & will push up a change soon.

Good find @naveentatikonda!