Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

jmazanec15 commented 1 year ago

Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:

public float compare(float[] v1, float[] v2) {
  return (1 + dotProduct(v1, v2)) / 2;
}

While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:

dotProd = dotProduct(v1, v2)

if (dotProd < 0) {
  return 1 / (1 + -1*dotProd);
}
return dotProd + 1;

and let the user worry about normalization

benwtrent commented 1 year ago

I honestly would much prefer us removing the normalization restriction and not do anything to the output. IMO, plain ol' dot_product is perfectly fine at scoring as higher values mean more relevant (just like Lucene scoring). I am not sure why this was scaled at all.

But, since this changes how things are scored (and that would be considered a breaking change, correct?), why not remove scaling at all and allow the score to be the result of dot_product directly?

benwtrent commented 1 year ago

Ok, did a bunch of reading on MAX Block WAND and MAXScore :/. I guess MBW is the reason for us bounding these vector scores somehow.

Technically, max inner product is unbounded. Encoding information in the vector magnitude is useful (especially for multi-lingual search from what I gather).

My questions would be:

does allowing unbounded scores from vectors completely break MBW?
if we need to scale it, can we do it continuously? I suppose that might break the existing scoring methodology.

jmazanec15 commented 1 year ago

@benwtrent Thanks for taking a look! Interesting, I am not too familiar with MBW. Ill take a look.

The main reason I wanted to avoid returning the dot product was to avoid negative scores, as ref by https://github.com/apache/lucene/issues/9044.

Also, what do you mean by scaling continuously? The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2 (please correct me if I am wrong).

benwtrent commented 1 year ago

Also, what do you mean by scaling continuously?

Your algorithm is piecewise vs. continuous. But, I am not sure how we could do a continuous transformation (everything is on the same scale). EDIT: Thinking more, I am not sure we would want to and your equation is OK. More thought here is required.

I am not too familiar with MBW.

Yeah, MBW and MAXSCORE get all mixed up in my brain. But, yes MAXSCORE is why we disallow negative scoring. Forgive my misdirection.

The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2

My main concern overall is this: we are changing the scoring methodology period for positive scores (and thus considered "valid"). I think this means that it cannot go into a Lucene 9.x release (correct @jpountz ?).

What do you think @msokolov ? Maybe we have a new MAX_INNER_PRODUCT scoring that uses @jmazanec15 suggestion?

msokolov commented 1 year ago

hmm, my view is dot_product is only usable if your vectors are normalized, as documented. I also don't think we can change the scoring formula in a minor release. As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice. It will produce the same rank as DOT_PRODUCT for normalized vectors and has the meaning of an actual metric (satisfies the triangle inequality). What does D.P. even mean for random vectors? What if one of the vectors is zero? Then it is equidistant to every other vector?

I guess I'd want to see a real use case before working to support this use case that seems weird to me. And honestly there are other distances that seem more useful (L1 norm, for example)

jpountz commented 1 year ago

Historically Lucene did not have restrictions on scores, but approaches for dynamic pruning like MAXSCORE and WAND assume that matching another clause would make the score higher, not lower. In hindsight, it made sense that scores should be non-negative, so we updated the contract of Scorer.score() and our built-in similarities to only produce non-negative scores.

I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need. To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores. And also bw compat as MikeS raised.

benwtrent commented 1 year ago

To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores.

I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless.

And also bw compat as MikeS raised.

I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it maximum-inner-product.

As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice.

Quoting a SLACK conversation with @nreimers:

Wow, what a bad implementation by Elastic. Models with unnormalized vectors and dot product work better for search than models with normalized vectors / cosine similarity. Models with cosine similarity have the issue that they often retrieve noise when your dataset gets noisier ...The best match for a query (e.g. What is the capital of the US ) with cosine similarity is the query itself, as cossim(query, query)=1. So when your corpus gets bigger and is not carefully cleaned, it contains many short documents that look like queries. These are preferably retrieved by the model, so the user asks a questions and gets as a response a doc that is paraphrase of the query (e.g. query="What is the capital of the US" top-1 hit: Capital of the US). Dot product has the tendency to work better when your corpus gets larger / noisy.

msokolov commented 1 year ago

thanks for the reference @benwtrent, that's an interesting perspective. I wouldn't be opposed to adding a new distance if people find it useful

benwtrent commented 1 year ago

The way I read this @jpountz

I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need.

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

If that is the case, I would suggest us adding a new scoring methodology that is simply the dot product and call it maximum inner product.

The current scaling for dot_product only makes sense for normalized vectors and it should only be treated as an optimization.

uschindler commented 1 year ago

For byte vectors we already have some guard built in (at least they can't get <0). See VectorUtil #dotProductScore. In the other issue #12281 I have also seen issues with float vectors that produced infinite floats as dotProduct or NaN as cosine (due to Infinity / Infinity => NaN). We wanted to open a new issue already, so this one fits.

So this also relates to this discussion: Should we have some constraints on vectors while they are indexed. In the other PR we added the requirement to make all their components finite.

jpountz commented 1 year ago

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.

uschindler commented 1 year ago

Is that negative scores here are OK as the optimization constraints traditionally required do not apply.

I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.

I think for floats it is not so easy like in the byte case with dotProductScore(). I also referenced to this from the other function query issue where the what the "default score" should be, if you have no vector in one of the documents. 0 works fine for classical scoring if you have only positve scores.

msokolov commented 1 year ago

would it make sense to truncate negative scores to zero? Since we think this is an abuse/misconfiguration, loss of information seems OK, and least we would be able to guarantee not to violate the "no negative scores" contract. Then if we want to have a separate score that scales in an information-preserving way, we can add it.

nreimers commented 1 year ago

@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation.

Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance.

But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges.

uschindler commented 1 year ago

@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation.

Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance.

But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges.

The problem is that this is not compatible with Lucene.

benwtrent commented 1 year ago

I would think as long as more negative values are scored lower, we will retrieve documents in a sane manner.

Scaling negatives to restrict them and then not scaling positive values at all could work. The _score wouldn't always be the dot-product exactly, but it allows KNN search to find the most relevant information, even if all of the dot-products are negative when comparing with the query vector.

This brings us back to @jmazanec15 suggestion on scaling scores.

msokolov commented 1 year ago

Yeah, after consideration, I think we could maybe argue for changing the scaling of negative values given that they were documented as unsupported, even though it would be breaking back-compat in the sense that scores would be changed. But I think we ought to preserve the scaling of non-negative values in case people have scaling factors they use for combining scores with other queries' scores. So we could go with @jmazanec15 suggestion except leaving in place the scale by 1/2?

benwtrent commented 1 year ago

@msokolov Ah, so negative values would live between (0, 0.5) and positive values would still be between [0.5,...)?

msokolov commented 1 year ago

Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.

jmazanec15 commented 1 year ago

I think the scores would have to be preserved for not only positive dot products amongst normalized vectors, but also negative ones, to avoid breaking bwc. I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5]

uschindler commented 1 year ago

Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.

Wasn't there the possibility to return a score for indexing and for search? Basically the VectorSimilarity enum could have a separate method called queryScore(v1, v2) that is enforced to be positive. Actually for cosine its not a problem as its normalized, so we can add 1 (and for safety to prevent rounding errors add Math.max(0, result)). The absolute values of scores are not important (unless you want to bring them together with other query scores, but for that you have boost of queries).

benwtrent commented 1 year ago

If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.

I am not sure why we care about separating VectorSimilarity and scoring. VectorSimilarity is only ever for KNN search and indexing and as long as vectors that are less similar score lower, its fine.

If we start thinking about separating out scoring and similarity, we should do it for all the current similarities. This would be significant work and it would be tricky. Think of EUCLIDEAN, we invert it's calculation so that a higher score means more similar. So, we would still need to use queryScore as the indexing similarity without significant changes to the underlying assumptions of the graph builder,etc.

If folks want to use the raw vector distances, they should use VectorUtil.

I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5]

I think you are correct @jmazanec15 since normalized vectors are in the unit-sphere. Its possible to have negative values (and thus fall into the [0, 0.5] range) when they point in opposite directions within the sphere. Your scaling method + a new MAX_INNER_PRODUCT similarity (which just uses dotProduct and scales it differently) covers the requirement of disallowing negative scores & non-normalized vectors.

This may complicate things (which 'dotProduct' should I use?!?!?!), but we should not change the existing VectorSimilarityFunction#DOT_PRODUCT. Maybe we can deprecate VectorSimilarityFunction#DOT_PRODUCT usage for new fields in 9x to encourage switching to MAX_INNER_PRODUCT and remove VectorSimilarityFunction#DOT_PRODUCT in 10.

jmazanec15 commented 1 year ago

@benwtrent I think that makes sense, but would add a little confusion.

How common is it to use Vector results with MAX_SCORE/WAND? I am wondering if it would be better to just leave as is in 9.x and change the warning message in the javadoc that non-normalized vectors are supported, but they should not be used with WAND/MAX_SCORE and can return negatives. And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?

benwtrent commented 1 year ago

And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?

If you are utilizing hybrid search, negating WAND/MAX_SCORE will slow things down significantly.

We should protect folks from shooting themselves in the foot.

but would add a little confusion.

I agree, there will be confusion. What do you think @uschindler & @msokolov ?

Being able to do non-normalized dot-product is an important aspect of recommendation engines and vector search as a whole. My imagination is too poor to come up with a better solution than adding a new similarity function that uses dot-product under the hood and scales differently.

benwtrent commented 1 year ago

@jmazanec15 have you done any digging into the dot-product scaling and if it provides good recall in the MAX-INNER-PRODUCT search use-case?

https://blog.vespa.ai/announcing-maximum-inner-product-search/ && https://towardsdatascience.com/maximum-inner-product-search-using-nearest-neighbor-search-algorithms-c125d24777ef

Implies there might be some weirdness with HNSW and raw MIP. I am honestly not 100% sure if Lucene has this issue specifically with HNSW.

A key observation in MIP is that a vector is no longer closest to itself, but instead it would be much closer to 2*vector than just vector.

jmazanec15 commented 1 year ago

@benwtrent I have been thinking about this and am still not completely sure of the implications. It seems like the construction of the graphs may rely on some assumption about the underlying space supporting the triangle inequality. Thus, with inner product space where this does not hold, the graph construction might have problems.

However, graphs aside, with brute force search, utilizing the scaled negative dot product would preserve the ordering of MIPs search.

I will try to think more about this this week.

benwtrent commented 1 year ago

@jmazanec15 adding the largest magnitude for scoring per segment isn't that bad a change in the codec if it means we can truly support maximum inner product. Plus it would be a change that would help other vector indexing codecs in the future besides HNSW.

jmazanec15 commented 1 year ago

@benwtrent Interesting, Im still not sure if this approach is necessary. I spoke with @searchivarius who is the maintainer of nmslib, and he mentioned that there was some research suggesting this is not required (https://proceedings.neurips.cc/paper_files/paper/2018/hash/229754d7799160502a143a72f6789927-Abstract.html, https://arxiv.org/pdf/1506.03163.pdf).

Let me try re-running the Vespa experiments with Lucene without the reduction and see what numbers we get. I dont think in the blog post they posted any comparison to using negative dot product approach (please correct me if I am missing something).

searchivarius commented 1 year ago

thank you @jmazanec15 : there's also an unpublished paper (I can share the preprint privately) where we benchmarked HNSW for maximum inner product search on 3 datasets and it was just fine (for this paper I did try the reduction to the cosine similarity and I also got poorer outcomes). In my thesis, I benchmarked SW-graph (which is pretty much HNSW when it comes to peculiarities of handling the inner product search) using an inner-product like similarity (fusion of BM25 and MODEL1 scores) and it was fine. See the black asterisk run in Figure 3.2.

Moreover, HNSW and SW-graph were tested with non-metric similarities (see again my thesis and references therein) as well as Yury Malkov's HNSW paper. These methods established SOTA results as well. There is also an extract from the thesis (published separately) that focuses specifically on search with non-metric similarities. Again, things just work..

One may wonder why, right? I think for real datasets the quirky distances don't deviate from the Euclidean distances all that much so the minimal set of geometric properties required for graph based retrieval is preserved (and no I don't think the triangle inequality is required).

Specifically, for the inner product search the outcomes are pretty close (in many cases) to the outcomes where the inner product search is replaced with cosine similarity (which is equivalent to L2 search) Why? Because with real embeddings the magnitude of vectors doesn't change all that much.

That said, there are of course degenerate cases (I know one, but embedding models don't produce such weirdness) where HNSW won't work with MIPS (or rather recall will be low). However, I am not aware of any realistic one. If you have some interesting examples of real datasets where direct application of HNSW/SW-graph fails, I would love to have a look.

REGARDING THE SCORE sign: dot-product scores need not be normalized, but the sign can be changed when the results is returned to the user.

benwtrent commented 1 year ago

Thank you for the deep information @searchivarius .

eagerly waiting your results @jmazanec15 :)

jmazanec15 commented 1 year ago

I ran an initial experiment. It appears that recall without the pre-processing is very high (99.1) compared to with the pre-processing (87.4), when mimicking one of the experiments from https://blog.vespa.ai/announcing-maximum-inner-product-search/.

That being said, @benwtrent would you be able to double check my experiment setup to ensure I didn't overlook something?

Experiment

Their experiment used the following data:

data set from https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings (485,859 docs)
400k 768-dimensional vectors (first 400k of the data set)
10k queries (last 10k of the data set)

And used the following config:

exploreAdditionalHits = 190
k = 10
max-links-per-node = 48
neighbors-to-explore-at-insert = 200
insert order random

For this, they reported a recall@10 of 87.4

I used luceneutil and set the following parameters:

single segment
maxConn: 48
beamWidthIndex: 200
fanout: 200
topK: 10
metric: 'angular'

I got a recall@10 of 99.1:

$ time python src/python/knnPerfTest.py
WARNING: Gnuplot module not present; will not make charts
lucene
{'ndoc': (400000,), 'maxConn': (48,), 'beamWidthIndex': (200,), 'fanout': (200,), 'topK': (10,)}
/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600894d841c4c004566b20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
['java', '-cp', '/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600894d841c4c004566b20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main', '--add-modules', 'jdk.incubator.vector', '-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false', 'KnnGraphTester', '-ndoc', '400000', '-maxConn', '48', '-beamWidthIndex', '200', '-fanout', '200', '-topK', '10', '-dim', '768', '-docs', '/home/ec2-user/data-prep/wiki768.train', '-reindex', '-search', '/home/ec2-user/data-prep/wiki768.test', '-metric', 'angular', '-quiet']
WARNING: Using incubator modules: jdk.incubator.vector

0.991    6.98   400000  200     48      200     210     1913700 1.00    post-filter

real    45m3.266s
user    40m8.451s
sys     4m54.290s

Dataset setup details

I pulled the data sets are parquet files from https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/tree/main/data: ``` curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00000-of-00004-1a1932c9ca1c7152.parquet curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00001-of-00004-f4a4f5540ade14b4.parquet curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00002-of-00004-ff770df3ab420d14.parquet curl -LO https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/resolve/main/data/train-00003-of-00004-85b3dbbc960e92ec.parquet ``` I ran the following to translate it into the data set that could be used by lucene util: (pip install numpy pyarrow) ``` import numpy as np import pyarrow.parquet as pq tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np3 = tb3[0].to_numpy() np4 = tb4[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) # Have to convert to a list here to get # the numpy ndarray's shape correct later # There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) np_flat_ds = np.array(flat_ds) # Shape is (485859, 768) and dtype is float32 np_flat_ds with open("wiki768.train", "w") as out_f: np_flat_ds[0:400000].tofile(out_f) with open("wiki768.test", "w") as out_f: np_flat_ds[475858:-1].tofile(out_f) ``` I then modified the KnnPerfTool.py to use this data set and set above parameters and ran the test.

searchivarius commented 1 year ago

@jmazanec15 thank you for running the experiments: what the speed ups?

jmazanec15 commented 1 year ago

@searchivarius For this, I didnt track latency. I was just checking if there is a change in recall when using the transformation vs. when not using the transformation based on the experiment vespa ran.

benwtrent commented 1 year ago

@jmazanec15 I will try to replicate later today. Quick question, did you merge to a single segment? This will have a dramatic change in recall as searching multiple segments gives you much higher recall with higher latency.

jmazanec15 commented 1 year ago

@benwtrent I uncommented these 2 lines: https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java#L699-L702 and set max buffer to 1000000.

This is the index as well:

$ ls -la wiki768.train-48-200.index
total 1232744
drwxr-xr-x.  2 ec2-user ec2-user      16384 Jul 20 05:50 .
drwxr-xr-x. 12 ec2-user ec2-user      16384 Jul 19 23:26 ..
-rw-r--r--.  1 ec2-user ec2-user        159 Jul 20 05:50 _6.fdm
-rw-r--r--.  1 ec2-user ec2-user    1619007 Jul 20 05:50 _6.fdt
-rw-r--r--.  1 ec2-user ec2-user       1437 Jul 20 05:50 _6.fdx
-rw-r--r--.  1 ec2-user ec2-user        195 Jul 20 05:50 _6.fnm
-rw-r--r--.  1 ec2-user ec2-user        464 Jul 20 05:50 _6.si
-rw-r--r--.  1 ec2-user ec2-user 1228800100 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vec
-rw-r--r--.  1 ec2-user ec2-user       9625 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vem
-rw-r--r--.  1 ec2-user ec2-user   31838005 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vex
-rw-r--r--.  1 ec2-user ec2-user        154 Jul 20 05:50 segments_6
-rw-r--r--.  1 ec2-user ec2-user          0 Jul 19 22:29 write.lock

jmazanec15 commented 1 year ago

@benwtrent I uncommented these 2 lines: https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java#L699-L702 and set max buffer to 1000000.

Edit: I take that back. I dont think I compiled with these changes. But I did see one segment produced in the end (_6_), suggesting that merge to 1 segment did happen. Regardless, I will re-run with the changes.

jmazanec15 commented 1 year ago

Update: I passed -forceMerge to KnnGraphTester and confirmed recall was again 0.991, confirming results above.

benwtrent commented 1 year ago

@jmazanec15 I followed your steps with the same data (forcemerging as well)

Instead of using dot_product as it is, I instead focused on the non-negative case (which is what it would be we supported this). So I used your piecewise transformation (negatives are between 0-1 and positives are unscaled scores of 1+).

This is what I got:

recall  latency nDoc    fanout  maxConn beamWidth   visited   index ms
0.989    2.74   400000  200 32  200         210   683712    1.00    post-filter

So, 0.989 recall at 2.7ms per query taking 683712ms to build the index. Not too shabby. Its interesting how the scaling slightly changes the recall number.

We should verify this is ok by feed the docs in a random order. We might be getting lucky in the graph building.

benwtrent commented 1 year ago

I updated the script for gathering the data to handle adversarial cases of magnitudes in order and reverse order.

I have ran the in-order version so far, testing the rest now.

ORDERED

WARNING: Gnuplot module not present; will not make charts
recall  latency nDoc    fanout  maxConn beamWidth   visited index ms
0.741    0.33   400000  0   32  200 10  0   1.00    post-filter
0.979    1.67   400000  90  32  200 100 0   1.00    post-filter
0.992    2.89   400000  190 32  200 200 0   1.00    post-filter

Updated script

```python import numpy as np import pyarrow.parquet as pq tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np4 = tb4[0].to_numpy() np3 = tb3[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) # Have to convert to a list here to get # the numpy ndarray's shape correct later # There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) np_flat_ds = np.array(flat_ds) # Shape is (485859, 768) and dtype is float32 np_flat_ds with open("wiki768.test", "w") as out_f: np_flat_ds[475858:-1].tofile(out_f) magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1) indices = np.argsort(magnitudes) np_flat_ds_sorted = np_flat_ds[indices] with open("wiki768.ordered.train", "w") as out_f: np_flat_ds_sorted.tofile(out_f) with open("wiki768.reversed.train", "w") as out_f: np.flip(np_flat_ds_sorted).tofile(out_f) with open("wiki768.random.train", "w") as out_f: np.random.shuffle(np_flat_ds_sorted) np_flat_ds_sorted.tofile(out_f) ```

searchivarius commented 1 year ago

thank you @benwtrent you didn't try the transformer yet, did you? You can easily convert vectors using, e.g., numpy, it's along the lines of adding one extra dimension that is zero for the query and the document vector D becomes:

Old dimensions are normalized by the max document norm: D/max_doc_norm One "fake" dimension is added: 1 - sqrt(|D|^2/(max_doc_norm^2))

benwtrent commented 1 year ago

@searchivarius I haven't. Here are the "reversed" numbers, obviously, this is where there is an issue in the adversarial case:

recall  latency nDoc    fanout  maxConn beamWidth   visited index ms
0.147    0.31   400000  0   32  200 10  0   1.00    post-filter
0.526    1.78   400000  90  32  200 100 0   1.00    post-filter
0.679    3.16   400000  190 32  200 200 0   1.00    post-filter
0.859    6.76   400000  490 32  200 500 0   1.00    post-filter

I can see about testing with a transformed set of vectors soonish.

benwtrent commented 1 year ago

Unless @jmazanec15 gets to testing the transformed vectors in reverse order before I do ;)

jmazanec15 commented 1 year ago

@benwtrent make sure to set maxConn to 48.

Also, I see I made a mistake setting fanout to 200 - should be 190 as you did.

Unless @jmazanec15 gets to testing the transformed vectors in reverse order before I do ;)

Yes, I can run this - if I cannot get to it today, I will get to it tomorrow.

One last thing: From these results, we are trying to decide if transformation is required now, correct?

benwtrent commented 1 year ago

@benwtrent make sure to set maxConn to 48.

🤦 yep! Here is with the higher max conn. Sort of better.

recall  latency nDoc    fanout  maxConn beamWidth   visited index ms
0.145    0.35   400000  0   48  200 10  0   1.00    post-filter
0.553    1.94   400000  90  48  200 100 0   1.00    post-filter
0.709    3.47   400000  190 48  200 200 0   1.00    post-filter
0.878    7.92   400000  490 48  200 500 0   1.00    post-filter

One last thing: From these results, we are trying to decide if transformation is required now, correct?

I think so. I honestly don't know if we want to worry about this purposefully adversarial case :/. If things are random, Lucene does perfectly well as is.

searchivarius commented 1 year ago

@benwtrent I don't think there's any truly adversarially robust ML algorithm. With PGD I can drive accuracy of any unprotected DL model to zero. Protected models have low clean accuracy so you can't use them in production

jmazanec15 commented 1 year ago

🤦 yep! Here is with the higher max conn. Sort of better.

Right, I was thinking this might explain the recall descrepency for the dotproduct score change (0.989 vs 0.991)

I ran the tests for non-transformed and the numbers seem pretty similar across the board:

### Random (default order)
recall  latency nDoc  fanout  maxConn beamWidth visited index ms
0.715    0.79   400000  0       48      200     10      1910428 1.00    post-filter
0.973    3.87   400000  90      48      200     100     1923226 1.00    post-filter
0.990    6.76   400000  190     48      200     200     1927580 1.00    post-filter
0.998   13.78   400000  490     48      200     500     1917602 1.00    post-filter

### Ascend
recall  latency nDoc  fanout  maxConn beamWidth visited index ms
0.771    0.89   400000  0       48      200     10      2093236 1.00    post-filter
0.983    4.45   400000  90      48      200     100     2095450 1.00    post-filter
0.993    7.88   400000  190     48      200     200     2094090 1.00    post-filter
0.998   16.08   400000  490     48      200     500     2112938 1.00    post-filter

### Descend
recall  latency nDoc  fanout  maxConn beamWidth visited index ms
0.710    0.79   400000  0       48      200     10      1915806 1.00    post-filter
0.973    3.73   400000  90      48      200     100     1910817 1.00    post-filter
0.991    6.55   400000  190     48      200     200     1898517 1.00    post-filter
0.998   13.25   400000  490     48      200     500     1912997 1.00    post-filter

@benwtrent For your results, I see that visited was 0 which might mean there is some kind of bug.

I transformed the data (thanks @searchivarius for help), and I got results that had overall lower recall, but were a little bit faster:

### Random (default order)
recall  latency nDoc  fanout  maxConn beamWidth visited index ms
0.359    0.36   400000  0       48      200     10      1464332 1.00    post-filter
0.728    1.39   400000  90      48      200     100     1457250 1.00    post-filter
0.801    2.43   400000  190     48      200     200     1471881 1.00    post-filter
0.874    5.28   400000  490     48      200     500     1458984 1.00    post-filter

### Ascend
recall  latency nDoc  fanout  maxConn beamWidth visited index ms
0.289    0.31   400000  0       48      200     10      1315149 1.00    post-filter
0.705    1.17   400000  90      48      200     100     1312877 1.00    post-filter
0.794    2.00   400000  190     48      200     200     1316609 1.00    post-filter
0.877    4.32   400000  490     48      200     500     1303967 1.00    post-filter

### Descend
recall  latency nDoc  fanout  maxConn beamWidth visited index ms
0.211    1.20   400000  0       48      200     10      2321339 1.00    post-filter
0.691    6.57   400000  90      48      200     100     2312672 1.00    post-filter
0.814   11.75   400000  190     48      200     200     2313213 1.00    post-filter
0.926   26.31   400000  490     48      200     500     2307567 1.00    post-filter

Based on these results and the paper's @searchivarius shared, I think its probably okay to not add this transform now.

Here is the script I used for transforming data set

import numpy as np import pyarrow.parquet as pq def transform_queries(Q): n, _ = Q.shape return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32) def transform_docs(D, norms): n, d = D.shape max_norm = magnitudes.max() flipped_norms = np.copy(norms).reshape(n, 1) transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32) return transformed_data def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto): assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different" def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto): n1, d1 = arr1.shape n2, d2 = arr2.shape assert n1 == n2, "Shape does not map" for i in range(n1): validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto) tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np3 = tb3[0].to_numpy() np4 = tb4[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) # Have to convert to a list here to get # the numpy ndarray's shape correct later # There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) # Shape is (485859, 768) and dtype is float32 np_flat_ds = np.array(flat_ds) transformed_queries = transform_queries(np_flat_ds[475858:-1]) validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[475858:-1], 768) with open("wiki768.test", "w") as out_f: transformed_queries.tofile(out_f) magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1) indices = np.argsort(magnitudes) transformed_np_flat_ds = transform_docs(np_flat_ds[0:400000], magnitudes) validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds[0:400000], 768) transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices] with open("wiki768.random.train", "w") as out_f: transformed_np_flat_ds.tofile(out_f) with open("wiki768.ordered.train", "w") as out_f: transformed_np_flat_ds_sorted.tofile(out_f) with open("wiki768.reversed.train", "w") as out_f: np.flip(transformed_np_flat_ds_sorted).tofile(out_f) ``` <\details>

searchivarius commented 1 year ago

Hi @jmazanec15 and @benwtrent : thanks a lot for testing. For higher recalls (somewhat higher or lower than 0.8) transformation seems to lead to substantial increase in latency. Not only for random, but also for ascend and descend mode.

benwtrent commented 1 year ago

@benwtrent For your results, I see that visited was 0 which might mean there is some kind of bug.

No, visited was correct, that 0 was for index build time. I only build the index once and then run the queries multiple times with different fanOut parameters. This way I don't pay the cost of reindex on every run unnecessarily :).

Thank you both for all this testing. I will verify the "reversed" numbers as those have the biggest discrepancy between @jmazanec15 results and mine.

The only difference I know of is that I did not allow negative scores and instead used the piecewise transformation in the original issue comment.

benwtrent commented 1 year ago

OK, I reran my experiments. I ran two, one with reverse non-transformed (so dimension within knnPerf is 768) and one with reverse transformed (dimensions are 769).

Reverse not transformed (768 dims)

recall  latency nDoc    fanout  maxConn beamWidth   visited index ms
0.145    0.38   400000  0   48  200 10  0   1.00    post-filter
0.553    2.05   400000  90  48  200 100 0   1.00    post-filter
0.709    3.66   400000  190 48  200 200 0   1.00    post-filter
0.878    8.05   400000  490 48  200 500 0   1.00    post-filter

Reversed transformed (769 dims)

recall  latency nDoc    fanout  maxConn beamWidth   visited index ms
0.211   0.49    400000  0   48  200 10  0   1.00    post-filter
0.691   2.80    400000  90  48  200 100 0   1.00    post-filter
0.814   5.14    400000  190 48  200 200 0   1.00    post-filter
0.926   11.31   400000  490 48  200 500 0   1.00    post-filter

Recall seems improved for me. Latency increases in the transformed data. I bet part of this is also the overhead of dealing with CPU execution lanes in Panama as its no longer a "nice" number of dimensions.

So, my transformed numbers match exactly @jmazanec15 results. However, I am getting some extreme discrepancy on my non-transformed.

@jmazanec15 here is the code I used to generate my "reverse" non transformed data. Could you double check and make sure your descending case data does the same?

There is something significant here that we are missing.

import numpy as np
import pyarrow.parquet as pq

tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb'])
tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb'])
tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb'])
tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb'])

np1 = tb1[0].to_numpy()
np2 = tb2[0].to_numpy()
np3 = tb3[0].to_numpy()
np4 = tb4[0].to_numpy()

np_total = np.concatenate((np1, np2, np3, np4))

#Have to convert to a list here to get
#the numpy ndarray's shape correct later
#There's probably a better way...
flat_ds = list()
for vec in np_total:
    flat_ds.append(vec)

#Shape is (485859, 768) and dtype is float32
np_flat_ds = np.array(flat_ds)

magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
indices = np.argsort(magnitudes)
np_flat_ds_sorted = np_flat_ds[indices]

with open("wiki768.reversed.train", "w") as out_f:
    np.flip(np_flat_ds_sorted).tofile(out_f)

searchivarius commented 1 year ago

@benwtrent @jmazanec15

Recall seems improved for me. Latency increases in the transformed data. I bet part of this is also the overhead of dealing with CPU execution lanes in Panama as its no longer a "nice" number of dimensions.

You need to look at the curves. The transform changes both recall and latency. The key question is: are we still on the same Pareto curve or not. Because if we are, getting a higher recall is merely a matter of choosing a larger M or ef, you do not need to support transformer in Lucene.

apache / lucene