apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores #12342

Closed jmazanec15 closed 1 year ago

jmazanec15 commented 1 year ago

Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:

public float compare(float[] v1, float[] v2) {
  return (1 + dotProduct(v1, v2)) / 2;
}

While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:

dotProd = dotProduct(v1, v2)

if (dotProd < 0) {
  return 1 / (1 + -1*dotProd);
}
return dotProd + 1;

and let the user worry about normalization

Related issue: https://github.com/opensearch-project/k-NN/issues/865

jmazanec15 commented 1 year ago

@benwtrent Sure, I think flip will reverse the dimensions as well as the order:

np_flat_ds_sorted = np_flat_ds[indices]
np_flat_ds_reverse = np.flip(np_flat_ds_sorted)
np_flat_ds_reverse[-1][0]
-0.18350956
np_flat_ds_sorted[0][0]
0.2331585
np_flat_ds_sorted[0][-1]
-0.18350956
np_flat_ds_reverse = np.flip(np_flat_ds_sorted, axis=0)

That being said, I thought I used your code for this - Im not sure how I got different numbers. I will retry reverse for both transform.

jmazanec15 commented 1 year ago

@searchivarius yes pareto curve will be good. I think this is possible in luceneutil. Want to first just compare with results vespa got.

benwtrent commented 1 year ago

I ran on some more ef_search values and here is a plot, these are the reversed, transformed vs. non-transformed.

It seems eventually, without transformation, even in the adversarial case, we can reach similar recall. I am not sure what other proof we need.

ef_search transformed_recall transform_latency baseline_recall baseline_latency
10 0.211 0.51 0.145 0.41
20 0.34 0.84 0.237 0.64
60 0.574 2.18 0.45 1.4
100 0.691 2.8 0.553 2.4
200 0.814 5.04 0.709 3.9
500 0.926 12.51 0.878 8.31
600 0.941 14.04 0.902 9.91
1000 0.967 19.84 0.956 14.85
2000 0.986 35.51 0.987 24.47
image image
searchivarius commented 1 year ago

@benwtrent thank you for the plot: however, the X axis is ef value and not latency. Could you plot a recall vs latency plot?

benwtrent commented 1 year ago

@searchivarius here is a google drive sheet so you can copy and mutate how you wish: https://docs.google.com/spreadsheets/d/1YvLWb3bYCgsBWqvwR1gJ3EMITzSGq20Z0nqy6so4ITA/edit?usp=sharing

Here is both recalls over latency, I confess I am not sure how to best compare the two result sets.

Here is baseline:

image

Here is the transformed:

image
searchivarius commented 1 year ago

@benwtrent @jmazanec15 after fiddling for 10 minutes with Google docs, I made a plot with Chat GPT (maybe I should have asked Bard instead)? Anyways, see the attached notebook and the attached picture. There seems to be virtually no difference between the original and the transformed runs.

That said, for high recall area (latency >= 15). The original run is better (compare two rightmost data points).

In terms of additional evidence required: I think we might want to do due diligence and to run more extensive tests, e.g., several datasets. @jmazanec15 publicizing results won't harm either, b/c we don't want people to do extra work when extra work is not needed. Moreover, if some clients do want to apply the transform for some reason, it can be easily done outside of the retrieval system, there's no need for the retrieval system to support it directlry.

MIPS MIPS_baseline_vs_transformed.ipynb.gz

jmazanec15 commented 1 year ago

Thanks @searchivarius sounds good.

Additonally, @benwtrent in regards to https://github.com/apache/lucene/issues/12342#issuecomment-1650293826, I ran the non-transformed reverse test when using np.flip(np_flat_ds_sorted, axis=0) and got pretty high recall numbers:

0.710    0.86   400000  0       48      200     10      2135520 1.00    post-filter
0.973    4.03   400000  90      48      200     100     2150331 1.00    post-filter
0.991    6.72   400000  190     48      200     200     2151494 1.00    post-filter
0.998   14.23   400000  490     48      200     500     2169778 1.00    post-filter

Did you change this for your tests? I can repeat the set of tests again, but want to just confirm everything looks good

benwtrent commented 1 year ago

@benwtrent Sure, I think flip will reverse the dimensions as well as the order:

LOL, let me re-run 🤦.

I think we might want to do due diligence and to run more extensive tests, e.g., several datasets

Do we have any additional data sets to test against? Its important to use datasets that encode information in the magnitude and cohere multi-lingual is the only one I know of offhand.

searchivarius commented 1 year ago

@benwtrent one can use some semi-synthetic data as well. We can take any embedding data set and scale each vector randomly. Scaling coefficient can be from a shifted normal: 1 + N(0, sigma). Anyways, let's re-run the existing benchmark first.

benwtrent commented 1 year ago

I fixed my bug (thanks @jmazanec15!). I will never forget how numpy#flip works now...

Here is my new run.

ef_search transformed_recall transform_latency baseline_recall baseline_latency
10 0.757 0.45 0.710 0.35
20 0.868 0.71 0.828 0.49
60 0.962 1.59 0.948 1.10
100 0.981 2.45 0.973 1.61
200 0.993 4.23 0.991 2.81
500 0.998 8.75 0.998 5.82
600 0.998 10.10 0.998 6.71
1000 0.999 14.80 1.00 9.71
image
searchivarius commented 1 year ago

Cool. Now thr nontransformed run is better.

benwtrent commented 1 year ago

OK ran on two more datasets, I only ran over the first 100k documents in my data sets.

EDIT: found a minor bug, the 'baseline' queries were being overwritten by the 'transform' ones (the ones with just a 0 at the 769th index). I reran to verify if results were effected; the weren't. Mainly because to measure recall, we take the brute-force kNN from the query file and compare to the approx-kNN. The data sets and the index built were not effected by this bug. If any more are found, I will happily re-run. Putting my M1 macbook to work :).

I think this exhausts our testing for Cohere, we need to find additional data if this is not considered enough.

EN

Reversed Ordered Random
image image image

JA

Reversed Ordered Random
image image image
updating data creation (after download) ```python import numpy as np import pyarrow.parquet as pq DATA_SETS =[ {"name": "wiki768", "files": [ "train-00000-of-00004-1a1932c9ca1c7152.parquet", "train-00001-of-00004-f4a4f5540ade14b4.parquet", "train-00002-of-00004-ff770df3ab420d14.parquet", "train-00003-of-00004-85b3dbbc960e92ec.parquet", ]},{ "name": "wiki768en", "files": [ "0-en.parquet", "1-en.parquet", "2-en.parquet", "3-en.parquet", ]}, {"name": "wiki768ja", "files": [ "0-ja.parquet", "1-ja.parquet", "2-ja.parquet", "3-ja.parquet", ]} ] def transform_queries(Q): n, _ = Q.shape return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32) def transform_docs(D, norms): n, d = D.shape max_norm = magnitudes.max() flipped_norms = np.copy(norms).reshape(n, 1) transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32) return transformed_data def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto): assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different" def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto): n1, d1 = arr1.shape n2, d2 = arr2.shape assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]" for i in range(n1): validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto) for ds in DATA_SETS: name = ds["name"] tb1 = pq.read_table(ds["files"][0], columns=['emb']) tb2 = pq.read_table(ds["files"][1], columns=['emb']) tb3 = pq.read_table(ds["files"][2], columns=['emb']) tb4 = pq.read_table(ds["files"][3], columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np3 = tb3[0].to_numpy() np4 = tb4[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) #Have to convert to a list here to get #the numpy ndarray's shape correct later #There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) np_flat_ds = np.array(flat_ds) row_count = np_flat_ds.shape[0] query_count = 10_000 training_rows = row_count - query_count print(f"{name} num rows: {training_rows}") transformed_queries = transform_queries(np_flat_ds[training_rows:-1]) validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[training_rows:-1], 768) with open(f"{name}-transform.test", "w") as out_f: transformed_queries.tofile(out_f) with open(f"{name}.test", "w") as out_f: np_flat_ds[training_rows:-1].tofile(out_f) magnitudes = np.linalg.norm(np_flat_ds[0:training_rows], axis=1) indices = np.argsort(magnitudes) transformed_np_flat_ds = transform_docs(np_flat_ds[0:training_rows], magnitudes) validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds[0:training_rows], 768) transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices] np_flat_ds_sorted = np_flat_ds[indices] with open(f"{name}.random-transform.train", "w") as out_f: transformed_np_flat_ds.tofile(out_f) with open(f"{name}.ordered-transform.train", "w") as out_f: transformed_np_flat_ds_sorted.tofile(out_f) with open(f"{name}.reversed-transform.train", "w") as out_f: np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f) with open(f"{name}.random.train", "w") as out_f: np.flip(np_flat_ds[0:training_rows], axis=0).tofile(out_f) with open(f"{name}.reversed.train", "w") as out_f: np.flip(np_flat_ds_sorted, axis=0).tofile(out_f) with open(f"{name}.ordered.train", "w") as out_f: np_flat_ds_sorted.tofile(out_f) ```
Useful parsing & plotting functions ```python def parse_console_output(terminal_output): # Regular expression patterns to extract recall and latency values recall_pattern = r"(?:\n\d+\.\d+)" latency_pattern = r"([\t, ]\d+\.\d+\t\d)" recall_values = [float(match.strip()) for match in re.findall(recall_pattern, terminal_output)] latency_values = [float(match.split()[0]) for match in re.findall(latency_pattern, terminal_output)] return (recall_values, latency_values) def plot_things(name, baseline_recall, baseline_latency, transformed_recall, transform_latency): # Plotting series one: transformed_recall vs transform_latency plt.plot(transform_latency, transformed_recall, marker='o', label='transformed') # Plotting series two: baseline_recall vs baseline_latency plt.plot(baseline_latency, baseline_recall, marker='o', label='original (baseline)') # Add labels and title plt.xlabel('Latency') plt.ylabel('Recall') plt.title(f"{name} Transformed vs Baseline recall & latency") plt.legend() # Show the plot plt.grid(True) plt.show() ``` To use them: ```python transformed_terminal_output = """ WARNING: Gnuplot module not present; will not make charts recall latency nDoc fanout maxConn beamWidth visited index ms WARNING: Using incubator modules: jdk.incubator.vector Jul 28, 2023 12:01:58 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider INFO: Java vector incubator API enabled; uses preferredBitSize=128 0.863 0.32 100000 0 48 200 10 0 1.00 post-filter ... """ baseline_terminal_output = """ WARNING: Gnuplot module not present; will not make charts recall latency nDoc fanout maxConn beamWidth visited index ms WARNING: Using incubator modules: jdk.incubator.vector Jul 28, 2023 12:34:59 PM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider INFO: Java vector incubator API enabled; uses preferredBitSize=128 0.816 0.23 100000 0 48 200 10 0 1.00 post-filter ... """ name = "WikiEN-Reversed" transformed_recall, transform_latency = parse_console_output(transformed_terminal_output) baseline_recall, baseline_latency = parse_console_output(baseline_terminal_output) plot_things(name, baseline_recall, baseline_latency, transformed_recall, transform_latency) ```
data downloading script ```sh #!/bin/sh # Japanese curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/0.parquet -o 0-ja.parquet curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/1.parquet -o 1-ja.parquet curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/33.parquet -o 2-ja.parquet curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-ja-embeddings/parquet/Cohere--wikipedia-22-12-ja-embeddings/train/34.parquet -o 3-ja.parquet # English curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/0.parquet -o 0-en.parquet curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/1.parquet -o 1-en.parquet curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/251.parquet -o 2-en.parquet curl -L https://huggingface.co/api/datasets/Cohere/wikipedia-22-12-en-embeddings/parquet/Cohere--wikipedia-22-12-en-embeddings/train/252.parquet -o 3-en.parquet ```
searchivarius commented 1 year ago

@benwtrent many thanks! This is even more convincing since now we have two cases where original is much better than transformed and one where it is roughly at par.

benwtrent commented 1 year ago

OK, here (unless we have done these incorrectly), is the final Cohere test (IMO). This is mixing English and Japanese embeddings (just in case the cohere model encodes info for language in magnitude). To mix the language embeddings, I appended the embedded rows together, shuffled them and then ran the previous testing procedure.

EN-JA

Reversed Ordered Random
!image image image
msokolov commented 1 year ago

Q: just want to make sure I understand what the transformation is that you are testing here. Is it that you are taking non-unit vectors and making them into unit vectors by dividing by the max length in the corpus and then adding another dimension to force the vector to be unit length? And then comparing this to max-inner-product on vectors of varying length? If so, what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors?

searchivarius commented 1 year ago

@msokolov the transformation also preserves the inner product up to a query-specific constant.

what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors

Yes, the training process often uses the inner/doct product and not the cosine similarity, so some information is being encoded by the vector length/magnitude. Typically, however, the variance isn't that huge.

nreimers commented 1 year ago

@msokolov In our BEIR paper we talked about this: https://arxiv.org/abs/2104.08663

The issue with cosine similarity is that it just encodes the topic. For the query What is Pytorch, and you have two docs: A: Pytorch is a framework

B: PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface

A cosine similarity model would retrieve A (it is more on-topic). A dotproduct model would retrieve B, as it is on-topic as well and contains more information on the topic (represented as a longer vector).

I.e. cossim(query, A) > cossim(query, B) but dotprod(query, A) < dotprod(query, B).

So dotproduct models can encode topic + quantity/quality of information, while cosine similarity models just encode topic match (which isn't perfect for retrieval).

benwtrent commented 1 year ago

I found another dataset, Yandex Text-to-image: https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search

I tested against the first 500_000 values in the 1M dataset.

It utilizes inner-product, looking at magnitudes, they are all < 1. So this dataset might not be that useful :/. The magnitudes range from 0.79 - 0.99.

I am just looking for more realistic inner-product search data sets and found this one.

Yandex Image-to-Text

Reversed Ordered Random
image image image
code for transforming yandex fbin ```python import numpy as np def read_fbin(filename, start_idx=0, chunk_size=None): """ Read *.fbin file that contains float32 vectors Args: :param filename (str): path to *.fbin file :param start_idx (int): start reading vectors from this index :param chunk_size (int): number of vectors to read. If None, read all vectors Returns: Array of float32 vectors (numpy.ndarray) """ with open(filename, "rb") as f: nvecs, dim = np.fromfile(f, count=2, dtype=np.int32) nvecs = (nvecs - start_idx) if chunk_size is None else chunk_size arr = np.fromfile(f, count=nvecs * dim, dtype=np.float32, offset=start_idx * 4 * dim) return arr.reshape(nvecs, dim) def transform_queries(Q): n, _ = Q.shape return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32) def transform_docs(D, norms): n, d = D.shape max_norm = magnitudes.max() flipped_norms = np.copy(norms).reshape(n, 1) transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32) return transformed_data def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto): assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different" def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto): n1, d1 = arr1.shape n2, d2 = arr2.shape assert n1 == n2, f"Shape does not map [{arr1.shape}] vs [{arr2.shape}]" for i in range(n1): validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto) name = "yandex" np_flat_ds = read_fbin("base.1M.fbin")[0:500_000] queries = read_fbin("query.public.100k.fbin")[0:10_000] transformed_queries = transform_queries(queries) validate_dataset_match_upto_dim(transformed_queries, queries, 200) with open(f"{name}-transform.test", "w") as out_f: transformed_queries.tofile(out_f) with open(f"{name}.test", "w") as out_f: queries.tofile(out_f) magnitudes = np.linalg.norm(np_flat_ds, axis=1) indices = np.argsort(magnitudes) transformed_np_flat_ds = transform_docs(np_flat_ds, magnitudes) validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds, 200) transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices] np_flat_ds_sorted = np_flat_ds[indices] with open(f"{name}.random-transform.train", "w") as out_f: transformed_np_flat_ds.tofile(out_f) with open(f"{name}.ordered-transform.train", "w") as out_f: transformed_np_flat_ds_sorted.tofile(out_f) with open(f"{name}.reversed-transform.train", "w") as out_f: np.flip(transformed_np_flat_ds_sorted, axis=0).tofile(out_f) with open(f"{name}.random.train", "w") as out_f: np_flat_ds.tofile(out_f) with open(f"{name}.reversed.train", "w") as out_f: np.flip(np_flat_ds_sorted, axis=0).tofile(out_f) with open(f"{name}.ordered.train", "w") as out_f: np_flat_ds_sorted.tofile(out_f) ```
searchivarius commented 1 year ago

@benwtrent thank you! despite this dataset is almost normalized, the non-transformed approach is still better for at least two scenarios (ordered and random).

benwtrent commented 1 year ago

@searchivarius && @msokolov do we think any additional testing should be done? Or are we comfortable with https://github.com/apache/lucene/pull/12479 going through as is with the testing done?

FYI, I reran my tests locally. I re-read the requirements and it seemed that in the "transformed" case, we SHOULD be using euclidean as the similarity comparator. My tests used "dot-product" (angular).

So, non-transformed == max-inner product and trainsformed == euclidean.

In my local reruns, there is almost no difference. The random case looks very similar, the ordered & reverse cases are more equal between the two runs, but the pareto graphs are effectively the same :/

searchivarius commented 1 year ago

Hi @benwtrent first question:

do you have your test code publicly available?

we SHOULD be using euclidean as the similarity comparator. My tests used "dot-product" (angular).

Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice:

L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2

What do you think? cc @jmazanec15

Additional experiments are always good. I have no immediate suggestion for realistic embeddings, though.

However, one can try to stress test the method by synthetically changing the data: Multiply or divide vector elements by a random uniform number in the range [1, M]. For large enough M, the transform might become beneficial.

benwtrent commented 1 year ago

Here are my results for the en-ja mixed dataset.

max inner product in baseline, euclidean in transformed

EN-Ja rerun

Reversed Ordered Random
image image image
Updated knnPerfTest.py My testing run I do the following: - Run all files & tests with just `fanout:0` but with `-reindex` passed as an arg. This builds the indices - Then I remove the `-reindex` the param and run with all the fanout parameters. I found this to be the quickest way to test. ```python i#!/usr/bin/env/python import subprocess import benchUtil import constants LUCENE_CHECKOUT = 'lucene_candidate' # test parameters. This script will run KnnGraphTester on every combination of these parameters VALUES = { 'ndoc': (100000,), 'maxConn': (48, ), 'beamWidthIndex': (200,), 'fanout': (0, 10, 50, 90, 190, 490, 590, 990), 'topK': (10,), } def advance(ix, values): for i in reversed(range(len(ix))): param = list(values.keys())[i] #print("advance " + param) if ix[i] == len(values[param]) - 1: ix[i] = 0 else: ix[i] += 1 return True return False def run_knn_benchmark(checkout, values, training_file, testing_file, dims, metric): indexes = [0] * len(values.keys()) indexes[-1] = -1 args = [] print(f"\n\n\nNow running{training_file}\n\n\n") dim = dims #768 doc_vectors = training_file query_vectors = testing_file cp = benchUtil.classPathToString(benchUtil.getClassPath(checkout)) JAVA_EXE = '/Users/benjamintrent/Library/Java/JavaVirtualMachines/jdk-20.0.1.jdk/Contents/Home/bin/java' cmd = [JAVA_EXE, '-cp', cp, '--add-modules', 'jdk.incubator.vector', '-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false', 'KnnGraphTester'] print("recall\tlatency\tnDoc\tfanout\tmaxConn\tbeamWidth\tvisited\tindex ms") while advance(indexes, values): pv = {} args = [] for (i, p) in enumerate(list(values.keys())): if p in values: if values[p]: value = values[p][indexes[i]] pv[p] = value else: args += ['-' + p] args += [a for (k, v) in pv.items() for a in ('-' + k, str(v)) if a] this_cmd = cmd + args + [ '-dim', str(dim), '-docs', doc_vectors, #'-reindex', '-metric', metric, '-search', query_vectors, '-forceMerge', '-quiet', ] subprocess.run(this_cmd) tests = [ ('%s/util/en_ja.random.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"), ('%s/util/en_ja.ordered.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"), ('%s/util/en_ja.reversed.train' % constants.BASE_DIR, '%s/util/en_ja.test' % constants.BASE_DIR, 768, "angular"), ('%s/util/en_ja.random-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"), ('%s/util/en_ja.ordered-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"), ('%s/util/en_ja.reversed-transform.train' % constants.BASE_DIR, '%s/util/en_ja-transform.test' % constants.BASE_DIR, 769, "euclidean"), ] for (training_file, testing_file, dims, metric) in tests: run_knn_benchmark(LUCENE_CHECKOUT, VALUES, training_file, testing_file, dims, metric) ```
searchivarius commented 1 year ago

@benwtrent thank you for re-running, much appreciated! The results look the same, but do we agree that after the transform is done, all three similarities / distances should be equivalent for the purpose of k-NN search? This is due to all documents being normalized to the same length (and query length is the same during the search, simply because it's the very same query). The equivalence is more obvious when queries and documents all have unit length. However, I think it is still true assuming consistent normalization of documents.

searchivarius commented 1 year ago

PS: @benwtrent could you publish benchmark code? We could possibly extend them, e.g., add new tests.

benwtrent commented 1 year ago

@searchivarius my latest comment has the updated knnPerfTest.py that I have been using via Lucene Util.

I use a build of Lucene that utilizes the dot-product scaling suggested by @jmazanec15 .

I take the console output and copy/paste into the extraction & plotting code given in this comment: https://github.com/apache/lucene/issues/12342#issuecomment-1656112434

All very manual.

jmazanec15 commented 1 year ago

Second, great you remembered it, but I think there's no difference between cosine and L2 (i.e., search results are the same) if queries and documents have constant norms. They don't have to be normalized to the unit norm, I think any constant would suffice:

L2=(a-b)^2 = |a|^2 - ab + |b|^2 = const1 - cosine_similarity(a,b) * const2

What do you think? cc @jmazanec15

@searchivarius yes I believe you are correct. Attached is a small proof that ordering is the same between euclidean distance and negative dot product when norm of all vectors is the same (please double check that I did not make any mistakes)

Proof Prove: Given a set of points, $S \subset \mathbb{R}^d$, where $\exists x \in \mathbb{R} \, \forall v \in S \, \|\|v\|\|^2 == x$. Then, $\forall q \in \mathbb{R}^d$, the ordering produced by $\|\|q - v_1\|\|^2 \le \|\|q - v_2\|\|^2 \iff v_1 \le v_2\$ is the same as the ordering produced by $-\langle q \, v_1\rangle \le -\langle q \, v_2\rangle \iff v_1 \le v_2\$. Starting with: $$\|\|q - v_1\|\|^2 \le \|\|q - v_2\|\|^2$$ By parallelogram law: $$\Rightarrow 2(\|\|q\|\|^2 + \|\|v_1\|\|^2) - \|\|q + v_1\|\|^2 \le 2(\|\|q\|\|^2 + \|\|v_2\|\|^2) - \|\|q + v_2\|\|^2$$ Removing equal terms: $$\Rightarrow - \|\|q + v_1\|\|^2 \le - \|\|q + v_2\|\|^2$$ Flipping sign: $$\Rightarrow \|\|q + v_1\|\|^2 \ge \|\|q + v_2\|\|^2$$ Definition of norm: $$\Rightarrow \langle q + v_1 \, q + v_1\rangle \ge \langle q + v_2 \, q + v_2\rangle$$ Expanding: $$\Rightarrow \langle q \, q \rangle + \langle v_1 \, v_1 \rangle + 2\langle q \, v_1 \rangle \ge \langle q \, q \rangle + \langle v_2 \, v_2 \rangle + 2\langle q \, v_2 \rangle$$ Removing equal terms: $$\Rightarrow 2\langle q \, v_1 \rangle \ge 2\langle q \, v_2 \rangle$$ Dividing and flipping sign: $$\Rightarrow -\langle q \, v_1 \rangle \le -\langle q \, v_2 \rangle$$
searchivarius commented 1 year ago

Seems to be correct (after checking what's exactly a parallelogram law is). However, as I mentioned above a simpler expansion of the dot product gives us that L2 is a monotonic transformation of the cosine similarity (and dot product as well, b/c norms are constant). In that the monotonicity preserves the order. I just wanted so make sue my simple derivation of:

L2=(a-b)^2 = |a|^2 - 2*(a,b) + |b|^2 = const1 - cosine_similarity(a,b) * const2

looked plausible. PS: forgot 2 before (a,b), but an extra constant doesn't change the outcome.

jmazanec15 commented 1 year ago

@searchivarius I see, yes thats correct.

benwtrent commented 1 year ago

Here are some results on synthetic datasets. I utilized max-inner-product (scaled to prevent negatives) for baseline, euclidean & and the typical transformation for transform.

Here is the code for generating the scaled data: https://gist.github.com/benwtrent/be084f4737f79dab7204eca3e6ab98fe

Normal(loc=1,scale=0.1)

Magnitude stats:

mean median var max min
5.768894195556641 5.769467830657959 0.054713960736989975 6.783757209777832 4.727940082550049
Reversed Ordered Random
image image image

Normal(loc=1,scale=0.2)

mean median var max min
10.196664810180664 10.195568084716797 0.03208443149924278 11.093703269958496 9.358368873596191
Reversed Ordered Random
image image image

Pareto(5)

mean median var max min
4.014744281768799 3.8864400386810303 0.5584204792976379 33.431556701660156 2.2141001224517822
Reversed Ordered Random
image image image

Uniform

mean median var max min
5.768894195556641 5.769467830657959 0.054713960736989975 6.783757209777832 4.727940082550049
Reversed Ordered Random
image image image

BiModal 0.5

Two normal distributions with loc=1 and loc=3, each multiplied by 0.5

mean median var max min
20.04954719543457 20.048864364624023 0.016332395374774933 20.65839195251465 19.469064712524414
Reversed Ordered Random
image image image

BiModal 0.9

Two normal distributions with loc=1 and loc=3, each multiplied by 0.9 & 0.1 respectively

mean median var max min
12.134689331054688 12.133764266967773 0.02648037299513817 12.946907997131348 11.34481430053711
Reversed Ordered Random
image image image

Gamma(shape=1,scale=1)

mean median var max min
14.072248458862305 13.95256233215332 1.9276633262634277 30.004880905151367 9.380194664001465
Reversed Ordered Random
image image image

Gamma(shape=2,scale=2)

mean median var max min
48.86893081665039 48.64986038208008 11.134326934814453 75.66817474365234 36.40324401855469
Reversed Ordered Random
image image image
searchivarius commented 1 year ago

Looking great, many thanks! Could you remind me what is ordered and reversed? This is something related to insertion order?

benwtrent commented 1 year ago

Yes, insertion order. So which docs are indexed first.

Ordered is sorted by magnitude in ascending order. So smallest magnitudes are indexed first.

Reverse is ordered reverse. So descending order by magnitudes (largest first)

benwtrent commented 1 year ago

Just to confirm the gamma numbers for the reversed case (as its the only one where transformed does better). I ran it again, but up to 2000 neighbors explored. The latency & recall graphs stay steady.

image

image

benwtrent commented 1 year ago

It makes sense that the transformation works better with the gamma distributions (but only when indexing the docs are ordered by magnitude) as it has by far the highest variance of all the others.

It is an exceptional case for data to not only have such a high variance, but the user purposefully indexes the documents sorted by their magnitude.

All our real datasets had a much lower variance. And in the random order case, baseline is still better than transformed. Even in the ordered & reverse cases, baseline was able to achieve 95% recall with just 1ms more latency.

Are we comfortable with moving forward and merging https://github.com/apache/lucene/pull/12479 bringing Max-inner-product into Lucene?

jmazanec15 commented 1 year ago

@benwtrent I am comfortable with it.

For future, if we ever decide to switch up the edge selection strategy where a node not only has to be smaller than a threshold distance, but smaller by a factor of X (I believe Vamana has a strategy like this), then we may need to take extra care that the scaling will work for the transformation - but this is not a problem now - and I believe there would be a similar problem with raw dot product anyway.