Closed jmazanec15 closed 1 year ago
@benwtrent Sure, I think flip will reverse the dimensions as well as the order:
np_flat_ds_sorted = np_flat_ds[indices]
np_flat_ds_reverse = np.flip(np_flat_ds_sorted)
np_flat_ds_reverse[-1][0]
-0.18350956
np_flat_ds_sorted[0][0]
0.2331585
np_flat_ds_sorted[0][-1]
-0.18350956
np_flat_ds_reverse = np.flip(np_flat_ds_sorted, axis=0)
That being said, I thought I used your code for this - Im not sure how I got different numbers. I will retry reverse for both transform.
@searchivarius yes pareto curve will be good. I think this is possible in luceneutil. Want to first just compare with results vespa got.
I ran on some more ef_search values and here is a plot, these are the reversed, transformed vs. non-transformed.
It seems eventually, without transformation, even in the adversarial case, we can reach similar recall. I am not sure what other proof we need.
ef_search | transformed_recall | transform_latency | baseline_recall | baseline_latency |
---|---|---|---|---|
10 | 0.211 | 0.51 | 0.145 | 0.41 |
20 | 0.34 | 0.84 | 0.237 | 0.64 |
60 | 0.574 | 2.18 | 0.45 | 1.4 |
100 | 0.691 | 2.8 | 0.553 | 2.4 |
200 | 0.814 | 5.04 | 0.709 | 3.9 |
500 | 0.926 | 12.51 | 0.878 | 8.31 |
600 | 0.941 | 14.04 | 0.902 | 9.91 |
1000 | 0.967 | 19.84 | 0.956 | 14.85 |
2000 | 0.986 | 35.51 | 0.987 | 24.47 |
@benwtrent thank you for the plot: however, the X axis is ef value and not latency. Could you plot a recall vs latency plot?
@searchivarius here is a google drive sheet so you can copy and mutate how you wish: https://docs.google.com/spreadsheets/d/1YvLWb3bYCgsBWqvwR1gJ3EMITzSGq20Z0nqy6so4ITA/edit?usp=sharing
Here is both recalls over latency, I confess I am not sure how to best compare the two result sets.
Here is baseline:
Here is the transformed:
@benwtrent @jmazanec15 after fiddling for 10 minutes with Google docs, I made a plot with Chat GPT (maybe I should have asked Bard instead)? Anyways, see the attached notebook and the attached picture. There seems to be virtually no difference between the original and the transformed runs.
That said, for high recall area (latency >= 15). The original run is better (compare two rightmost data points).
In terms of additional evidence required: I think we might want to do due diligence and to run more extensive tests, e.g., several datasets. @jmazanec15 publicizing results won't harm either, b/c we don't want people to do extra work when extra work is not needed. Moreover, if some clients do want to apply the transform for some reason, it can be easily done outside of the retrieval system, there's no need for the retrieval system to support it directlry.
Thanks @searchivarius sounds good.
Additonally, @benwtrent in regards to https://github.com/apache/lucene/issues/12342#issuecomment-1650293826, I ran the non-transformed reverse test when using np.flip(np_flat_ds_sorted, axis=0)
and got pretty high recall numbers:
0.710 0.86 400000 0 48 200 10 2135520 1.00 post-filter
0.973 4.03 400000 90 48 200 100 2150331 1.00 post-filter
0.991 6.72 400000 190 48 200 200 2151494 1.00 post-filter
0.998 14.23 400000 490 48 200 500 2169778 1.00 post-filter
Did you change this for your tests? I can repeat the set of tests again, but want to just confirm everything looks good
@benwtrent Sure, I think flip will reverse the dimensions as well as the order:
LOL, let me re-run 🤦.
I think we might want to do due diligence and to run more extensive tests, e.g., several datasets
Do we have any additional data sets to test against? Its important to use datasets that encode information in the magnitude and cohere multi-lingual is the only one I know of offhand.
@benwtrent one can use some semi-synthetic data as well. We can take any embedding data set and scale each vector randomly. Scaling coefficient can be from a shifted normal: 1 + N(0, sigma). Anyways, let's re-run the existing benchmark first.
I fixed my bug (thanks @jmazanec15!). I will never forget how numpy#flip
works now...
Here is my new run.
ef_search | transformed_recall | transform_latency | baseline_recall | baseline_latency |
---|---|---|---|---|
10 | 0.757 | 0.45 | 0.710 | 0.35 |
20 | 0.868 | 0.71 | 0.828 | 0.49 |
60 | 0.962 | 1.59 | 0.948 | 1.10 |
100 | 0.981 | 2.45 | 0.973 | 1.61 |
200 | 0.993 | 4.23 | 0.991 | 2.81 |
500 | 0.998 | 8.75 | 0.998 | 5.82 |
600 | 0.998 | 10.10 | 0.998 | 6.71 |
1000 | 0.999 | 14.80 | 1.00 | 9.71 |
Cool. Now thr nontransformed run is better.
OK ran on two more datasets, I only ran over the first 100k documents in my data sets.
EDIT: found a minor bug, the 'baseline' queries were being overwritten by the 'transform' ones (the ones with just a 0
at the 769th index). I reran to verify if results were effected; the weren't. Mainly because to measure recall, we take the brute-force kNN from the query file and compare to the approx-kNN. The data sets and the index built were not effected by this bug. If any more are found, I will happily re-run. Putting my M1 macbook to work :).
I think this exhausts our testing for Cohere, we need to find additional data if this is not considered enough.
Reversed | Ordered | Random |
---|---|---|
Reversed | Ordered | Random |
---|---|---|
@benwtrent many thanks! This is even more convincing since now we have two cases where original is much better than transformed and one where it is roughly at par.
OK, here (unless we have done these incorrectly), is the final Cohere test (IMO). This is mixing English and Japanese embeddings (just in case the cohere model encodes info for language in magnitude). To mix the language embeddings, I appended the embedded rows together, shuffled them and then ran the previous testing procedure.
Reversed | Ordered | Random |
---|---|---|
! |
Q: just want to make sure I understand what the transformation is that you are testing here. Is it that you are taking non-unit vectors and making them into unit vectors by dividing by the max length in the corpus and then adding another dimension to force the vector to be unit length? And then comparing this to max-inner-product on vectors of varying length? If so, what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors?
@msokolov the transformation also preserves the inner product up to a query-specific constant.
what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors
Yes, the training process often uses the inner/doct product and not the cosine similarity, so some information is being encoded by the vector length/magnitude. Typically, however, the variance isn't that huge.
@msokolov In our BEIR paper we talked about this: https://arxiv.org/abs/2104.08663
The issue with cosine similarity is that it just encodes the topic. For the query What is Pytorch
, and you have two docs:
A:
Pytorch is a framework
B:
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface
A cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector).
I.e. cossim(query, A) > cossim(query, B) but dotprod(query, A) < dotprod(query, B).
So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn't perfect for retrieval).
I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search
I tested against the first 500_000 values in the 1M dataset.
It utilizes inner-product, looking at magnitudes, they are all < 1. So this dataset might not be that useful :/. The magnitudes range from 0.79 - 0.99.
I am just looking for more realistic inner-product search data sets and found this one.
Reversed | Ordered | Random |
---|---|---|
@benwtrent thank you! despite this dataset is almost normalized, the non-transformed approach is still better for at least two scenarios (ordered and random).
@searchivarius && @msokolov do we think any additional testing should be done? Or are we comfortable with https://github.com/apache/lucene/pull/12479 going through as is with the testing done?
FYI, I reran my tests locally. I re-read the requirements and it seemed that in the "transformed" case, we SHOULD be using euclidean
as the similarity comparator. My tests used "dot-product" (angular).
So, non-transformed == max-inner product and trainsformed == euclidean.
In my local reruns, there is almost no difference. The random case looks very similar, the ordered & reverse cases are more equal between the two runs, but the pareto graphs are effectively the same :/
Hi @benwtrent first question:
do you have your test code publicly available?
we SHOULD be using euclidean as the similarity comparator. My tests used "dot-product" (angular).
Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice:
L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2
What do you think? cc @jmazanec15
Additional experiments are always good. I have no immediate suggestion for realistic embeddings, though.
However, one can try to stress test the method by synthetically changing the data: Multiply or divide vector elements by a random uniform number in the range [1, M]. For large enough M, the transform might become beneficial.
Here are my results for the en-ja mixed dataset.
max inner product in baseline, euclidean in transformed
Reversed | Ordered | Random |
---|---|---|
@benwtrent thank you for re-running, much appreciated! The results look the same, but do we agree that after the transform is done, all three similarities / distances should be equivalent for the purpose of k-NN search? This is due to all documents being normalized to the same length (and query length is the same during the search, simply because it's the very same query). The equivalence is more obvious when queries and documents all have unit length. However, I think it is still true assuming consistent normalization of documents.
PS: @benwtrent could you publish benchmark code? We could possibly extend them, e.g., add new tests.
@searchivarius my latest comment has the updated knnPerfTest.py
that I have been using via Lucene Util.
I use a build of Lucene that utilizes the dot-product scaling suggested by @jmazanec15 .
I take the console output and copy/paste into the extraction & plotting code given in this comment: https://github.com/apache/lucene/issues/12342#issuecomment-1656112434
All very manual.
Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice:
L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2
What do you think? cc @jmazanec15
@searchivarius yes I believe you are correct. Attached is a small proof that ordering is the same between euclidean distance and negative dot product when norm of all vectors is the same (please double check that I did not make any mistakes)
Seems to be correct (after checking what's exactly a parallelogram law is). However, as I mentioned above a simpler expansion of the dot product gives us that L2 is a monotonic transformation of the cosine similarity (and dot product as well, b/c norms are constant). In that the monotonicity preserves the order. I just wanted so make sue my simple derivation of:
L2=(a-b)^2 = |a|^2 - 2*(a,b) + |b|^2 = const1 - cosine_similarity(a,b) * const2
looked plausible. PS: forgot 2 before (a,b), but an extra constant doesn't change the outcome.
@searchivarius I see, yes thats correct.
Here are some results on synthetic datasets. I utilized max-inner-product (scaled to prevent negatives) for baseline, euclidean & and the typical transformation for transform.
Here is the code for generating the scaled data: https://gist.github.com/benwtrent/be084f4737f79dab7204eca3e6ab98fe
Magnitude stats:
mean median var max min
5.768894195556641 5.769467830657959 0.054713960736989975 6.783757209777832 4.727940082550049
Reversed | Ordered | Random |
---|---|---|
mean median var max min
10.196664810180664 10.195568084716797 0.03208443149924278 11.093703269958496 9.358368873596191
Reversed | Ordered | Random |
---|---|---|
mean median var max min
4.014744281768799 3.8864400386810303 0.5584204792976379 33.431556701660156 2.2141001224517822
Reversed | Ordered | Random |
---|---|---|
mean median var max min
5.768894195556641 5.769467830657959 0.054713960736989975 6.783757209777832 4.727940082550049
Reversed | Ordered | Random |
---|---|---|
Two normal distributions with loc=1
and loc=3
, each multiplied by 0.5
mean median var max min
20.04954719543457 20.048864364624023 0.016332395374774933 20.65839195251465 19.469064712524414
Reversed | Ordered | Random |
---|---|---|
Two normal distributions with loc=1
and loc=3
, each multiplied by 0.9 & 0.1 respectively
mean median var max min
12.134689331054688 12.133764266967773 0.02648037299513817 12.946907997131348 11.34481430053711
Reversed | Ordered | Random |
---|---|---|
mean median var max min
14.072248458862305 13.95256233215332 1.9276633262634277 30.004880905151367 9.380194664001465
Reversed | Ordered | Random |
---|---|---|
mean median var max min
48.86893081665039 48.64986038208008 11.134326934814453 75.66817474365234 36.40324401855469
Reversed | Ordered | Random |
---|---|---|
Looking great, many thanks! Could you remind me what is ordered and reversed? This is something related to insertion order?
Yes, insertion order. So which docs are indexed first.
Ordered is sorted by magnitude in ascending order. So smallest magnitudes are indexed first.
Reverse is ordered reverse. So descending order by magnitudes (largest first)
Just to confirm the gamma numbers for the reversed case (as its the only one where transformed does better). I ran it again, but up to 2000 neighbors explored. The latency & recall graphs stay steady.
It makes sense that the transformation works better with the gamma distributions (but only when indexing the docs are ordered by magnitude) as it has by far the highest variance of all the others.
It is an exceptional case for data to not only have such a high variance, but the user purposefully indexes the documents sorted by their magnitude.
All our real datasets had a much lower variance. And in the random order case, baseline is still better than transformed. Even in the ordered & reverse cases, baseline was able to achieve 95% recall with just 1ms more latency.
Are we comfortable with moving forward and merging https://github.com/apache/lucene/pull/12479 bringing Max-inner-product into Lucene?
@benwtrent I am comfortable with it.
For future, if we ever decide to switch up the edge selection strategy where a node not only has to be smaller than a threshold distance, but smaller by a factor of X (I believe Vamana has a strategy like this), then we may need to take extra care that the scaling will work for the transformation - but this is not a problem now - and I believe there would be a similar problem with raw dot product anyway.
Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:
While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:
and let the user worry about normalization
Related issue: https://github.com/opensearch-project/k-NN/issues/865