Closed jmazanec15 closed 1 year ago
I honestly would much prefer us removing the normalization restriction and not do anything to the output. IMO, plain ol' dot_product is perfectly fine at scoring as higher values mean more relevant (just like Lucene scoring). I am not sure why this was scaled at all.
But, since this changes how things are scored (and that would be considered a breaking change, correct?), why not remove scaling at all and allow the score to be the result of dot_product
directly?
Ok, did a bunch of reading on MAX Block WAND and MAXScore :/. I guess MBW is the reason for us bounding these vector scores somehow.
Technically, max inner product is unbounded. Encoding information in the vector magnitude is useful (especially for multi-lingual search from what I gather).
My questions would be:
@benwtrent Thanks for taking a look! Interesting, I am not too familiar with MBW. Ill take a look.
The main reason I wanted to avoid returning the dot product was to avoid negative scores, as ref by https://github.com/apache/lucene/issues/9044.
Also, what do you mean by scaling continuously? The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2 (please correct me if I am wrong).
Also, what do you mean by scaling continuously?
Your algorithm is piecewise vs. continuous. But, I am not sure how we could do a continuous transformation (everything is on the same scale). EDIT: Thinking more, I am not sure we would want to and your equation is OK. More thought here is required.
I am not too familiar with MBW.
Yeah, MBW and MAXSCORE get all mixed up in my brain. But, yes MAXSCORE is why we disallow negative scoring. Forgive my misdirection.
The above formula gives negative and positive scores the same number of possible scores, reducing overall precision by 2
My main concern overall is this: we are changing the scoring methodology period for positive scores (and thus considered "valid"). I think this means that it cannot go into a Lucene 9.x release (correct @jpountz ?).
What do you think @msokolov ? Maybe we have a new MAX_INNER_PRODUCT
scoring that uses @jmazanec15 suggestion?
hmm, my view is dot_product is only usable if your vectors are normalized, as documented. I also don't think we can change the scoring formula in a minor release. As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice. It will produce the same rank as DOT_PRODUCT for normalized vectors and has the meaning of an actual metric (satisfies the triangle inequality). What does D.P. even mean for random vectors? What if one of the vectors is zero? Then it is equidistant to every other vector?
I guess I'd want to see a real use case before working to support this use case that seems weird to me. And honestly there are other distances that seem more useful (L1 norm, for example)
Historically Lucene did not have restrictions on scores, but approaches for dynamic pruning like MAXSCORE and WAND assume that matching another clause would make the score higher, not lower. In hindsight, it made sense that scores should be non-negative, so we updated the contract of Scorer.score()
and our built-in similarities to only produce non-negative scores.
I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need. To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores. And also bw compat as MikeS raised.
To me the main concern is more that unbounded scores make it hard to combine scores with another query via a disjunction as it's hard to know ahead of time whether the vector may completely dominate scores.
I say we have that problem now. Vector scores and BM25 are nowhere near on the same scale. Folks need to adjust their boosts accordingly, regardless.
And also bw compat as MikeS raised.
I agree, BWC is a big deal here. And I suggest we create a new similarity that just uses dot product under the hood. Call it maximum-inner-product
.
As for producing a D.P. that is scaled for use with arbitrary vectors I don't see the point really. If what you want is to handle arbitrary scaled vectors, EUCLIDEAN is a better choice.
Quoting a SLACK conversation with @nreimers:
Wow, what a bad implementation by Elastic. Models with unnormalized vectors and dot product work better for search than models with normalized vectors / cosine similarity. Models with cosine similarity have the issue that they often retrieve noise when your dataset gets noisier ...The best match for a query (e.g. What is the capital of the US ) with cosine similarity is the query itself, as cossim(query, query)=1. So when your corpus gets bigger and is not carefully cleaned, it contains many short documents that look like queries. These are preferably retrieved by the model, so the user asks a questions and gets as a response a doc that is paraphrase of the query (e.g. query="What is the capital of the US" top-1 hit: Capital of the US). Dot product has the tendency to work better when your corpus gets larger / noisy.
thanks for the reference @benwtrent, that's an interesting perspective. I wouldn't be opposed to adding a new distance if people find it useful
The way I read this @jpountz
I'm not too worried about having a vector similarity that produces unbounded scores from the perspective of MAXSCORE/WAND. The way things work today, the vector query is first rewritten into a query that maps a small set of doc IDs to scores, so we can easily get access to the maximum score over a range of doc IDs, which is what WAND/MAXSCORE need.
Is that negative scores here are OK as the optimization constraints traditionally required do not apply.
If that is the case, I would suggest us adding a new scoring methodology that is simply the dot product and call it maximum inner product.
The current scaling for dot_product only makes sense for normalized vectors and it should only be treated as an optimization.
For byte vectors we already have some guard built in (at least they can't get <0). See VectorUtil #dotProductScore
. In the other issue #12281 I have also seen issues with float vectors that produced infinite floats as dotProduct or NaN as cosine (due to Infinity / Infinity => NaN
). We wanted to open a new issue already, so this one fits.
So this also relates to this discussion: Should we have some constraints on vectors while they are indexed. In the other PR we added the requirement to make all their components finite.
Is that negative scores here are OK as the optimization constraints traditionally required do not apply.
I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.
Is that negative scores here are OK as the optimization constraints traditionally required do not apply.
I was to convey that it should be ok not to have score upper bounds for the produced scores. I still think scores should always be non-negative.
I think for floats it is not so easy like in the byte case with dotProductScore()
. I also referenced to this from the other function query issue where the what the "default score" should be, if you have no vector in one of the documents. 0 works fine for classical scoring if you have only positve scores.
would it make sense to truncate negative scores to zero? Since we think this is an abuse/misconfiguration, loss of information seems OK, and least we would be able to guarantee not to violate the "no negative scores" contract. Then if we want to have a separate score that scales in an information-preserving way, we can add it.
@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation.
Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance.
But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges.
@msokolov The index / vector DB should return the dot product score as is. No scaling, no truncation.
Using dot product is tremendously useful for embedding models, they perform in asymmetric settings where you want to map a short search query to a longer relevant document (which is the most common case in search) much better than cosine similarity or euclidean distance.
But here the index should return the values as is and it should then be up to the user to truncate negative scores or to normalize these scores to pre-defined ranges.
The problem is that this is not compatible with Lucene.
I would think as long as more negative values are scored lower, we will retrieve documents in a sane manner.
Scaling negatives to restrict them and then not scaling positive values at all could work. The _score
wouldn't always be the dot-product exactly, but it allows KNN search to find the most relevant information, even if all of the dot-products are negative when comparing with the query vector.
This brings us back to @jmazanec15 suggestion on scaling scores.
Yeah, after consideration, I think we could maybe argue for changing the scaling of negative values given that they were documented as unsupported, even though it would be breaking back-compat in the sense that scores would be changed. But I think we ought to preserve the scaling of non-negative values in case people have scaling factors they use for combining scores with other queries' scores. So we could go with @jmazanec15 suggestion except leaving in place the scale by 1/2?
@msokolov Ah, so negative values would live between (0, 0.5)
and positive values would still be between [0.5,...)
?
Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.
I think the scores would have to be preserved for not only positive dot products amongst normalized vectors, but also negative ones, to avoid breaking bwc. I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5]
Yeah. Another thing we could consider is doing this scaling in KnnVectorQuery and/or its Scorer. These have the ultimate responsibility of complying with the Scorer contract. If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.
Wasn't there the possibility to return a score for indexing and for search? Basically the VectorSimilarity enum could have a separate method called queryScore(v1, v2) that is enforced to be positive. Actually for cosine its not a problem as its normalized, so we can add 1 (and for safety to prevent rounding errors add Math.max(0, result)
). The absolute values of scores are not important (unless you want to bring them together with other query scores, but for that you have boost of queries).
If we did it there we wouldn't have to change the output of VectorSimilarity. However it's messy to do it there since this is specific to a particular similarity implementation, so on balance doing it in the similarity makes more sense to me.
I am not sure why we care about separating VectorSimilarity and scoring. VectorSimilarity is only ever for KNN search and indexing and as long as vectors that are less similar score lower, its fine.
If we start thinking about separating out scoring and similarity, we should do it for all the current similarities. This would be significant work and it would be tricky. Think of EUCLIDEAN, we invert it's calculation so that a higher score means more similar. So, we would still need to use queryScore
as the indexing similarity without significant changes to the underlying assumptions of the graph builder,etc.
If folks want to use the raw vector distances, they should use VectorUtil
.
I think the current range of dot products that are valid is [-1, 1] and scores map to [0, 1]. So I dont think we could map all negative values between [0, 0.5]
I think you are correct @jmazanec15 since normalized vectors are in the unit-sphere. Its possible to have negative values (and thus fall into the [0, 0.5] range) when they point in opposite directions within the sphere. Your scaling method + a new MAX_INNER_PRODUCT
similarity (which just uses dotProduct
and scales it differently) covers the requirement of disallowing negative scores & non-normalized vectors.
This may complicate things (which 'dotProduct' should I use?!?!?!), but we should not change the existing VectorSimilarityFunction#DOT_PRODUCT
. Maybe we can deprecate VectorSimilarityFunction#DOT_PRODUCT
usage for new fields in 9x
to encourage switching to MAX_INNER_PRODUCT
and remove VectorSimilarityFunction#DOT_PRODUCT
in 10
.
@benwtrent I think that makes sense, but would add a little confusion.
How common is it to use Vector results with MAX_SCORE/WAND? I am wondering if it would be better to just leave as is in 9.x and change the warning message in the javadoc that non-normalized vectors are supported, but they should not be used with WAND/MAX_SCORE and can return negatives. And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?
And then switch the score to scale in 10 as a breaking change. Or is condoning negative scores under any circumstances a non-starter?
If you are utilizing hybrid search, negating WAND/MAX_SCORE will slow things down significantly.
We should protect folks from shooting themselves in the foot.
but would add a little confusion.
I agree, there will be confusion. What do you think @uschindler & @msokolov ?
Being able to do non-normalized dot-product is an important aspect of recommendation engines and vector search as a whole. My imagination is too poor to come up with a better solution than adding a new similarity function that uses dot-product under the hood and scales differently.
@jmazanec15 have you done any digging into the dot-product scaling and if it provides good recall in the MAX-INNER-PRODUCT search use-case?
https://blog.vespa.ai/announcing-maximum-inner-product-search/ && https://towardsdatascience.com/maximum-inner-product-search-using-nearest-neighbor-search-algorithms-c125d24777ef
Implies there might be some weirdness with HNSW and raw MIP. I am honestly not 100% sure if Lucene has this issue specifically with HNSW.
A key observation in MIP is that a vector
is no longer closest to itself, but instead it would be much closer to 2*vector
than just vector
.
@benwtrent I have been thinking about this and am still not completely sure of the implications. It seems like the construction of the graphs may rely on some assumption about the underlying space supporting the triangle inequality. Thus, with inner product space where this does not hold, the graph construction might have problems.
However, graphs aside, with brute force search, utilizing the scaled negative dot product would preserve the ordering of MIPs search.
I will try to think more about this this week.
@jmazanec15 adding the largest magnitude for scoring per segment isn't that bad a change in the codec if it means we can truly support maximum inner product. Plus it would be a change that would help other vector indexing codecs in the future besides HNSW.
@benwtrent Interesting, Im still not sure if this approach is necessary. I spoke with @searchivarius who is the maintainer of nmslib, and he mentioned that there was some research suggesting this is not required (https://proceedings.neurips.cc/paper_files/paper/2018/hash/229754d7799160502a143a72f6789927-Abstract.html, https://arxiv.org/pdf/1506.03163.pdf).
Let me try re-running the Vespa experiments with Lucene without the reduction and see what numbers we get. I dont think in the blog post they posted any comparison to using negative dot product approach (please correct me if I am missing something).
thank you @jmazanec15 : there's also an unpublished paper (I can share the preprint privately) where we benchmarked HNSW for maximum inner product search on 3 datasets and it was just fine (for this paper I did try the reduction to the cosine similarity and I also got poorer outcomes). In my thesis, I benchmarked SW-graph (which is pretty much HNSW when it comes to peculiarities of handling the inner product search) using an inner-product like similarity (fusion of BM25 and MODEL1 scores) and it was fine. See the black asterisk run in Figure 3.2.
Moreover, HNSW and SW-graph were tested with non-metric similarities (see again my thesis and references therein) as well as Yury Malkov's HNSW paper. These methods established SOTA results as well. There is also an extract from the thesis (published separately) that focuses specifically on search with non-metric similarities. Again, things just work..
One may wonder why, right? I think for real datasets the quirky distances don't deviate from the Euclidean distances all that much so the minimal set of geometric properties required for graph based retrieval is preserved (and no I don't think the triangle inequality is required).
Specifically, for the inner product search the outcomes are pretty close (in many cases) to the outcomes where the inner product search is replaced with cosine similarity (which is equivalent to L2 search) Why? Because with real embeddings the magnitude of vectors doesn't change all that much.
That said, there are of course degenerate cases (I know one, but embedding models don't produce such weirdness) where HNSW won't work with MIPS (or rather recall will be low). However, I am not aware of any realistic one. If you have some interesting examples of real datasets where direct application of HNSW/SW-graph fails, I would love to have a look.
REGARDING THE SCORE sign: dot-product scores need not be normalized, but the sign can be changed when the results is returned to the user.
Thank you for the deep information @searchivarius .
eagerly waiting your results @jmazanec15 :)
I ran an initial experiment. It appears that recall without the pre-processing is very high (99.1) compared to with the pre-processing (87.4), when mimicking one of the experiments from https://blog.vespa.ai/announcing-maximum-inner-product-search/.
That being said, @benwtrent would you be able to double check my experiment setup to ensure I didn't overlook something?
Their experiment used the following data:
And used the following config:
For this, they reported a recall@10 of 87.4
I used luceneutil and set the following parameters:
I got a recall@10 of 99.1:
$ time python src/python/knnPerfTest.py
WARNING: Gnuplot module not present; will not make charts
lucene
{'ndoc': (400000,), 'maxConn': (48,), 'beamWidthIndex': (200,), 'fanout': (200,), 'topK': (10,)}
/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600894d841c4c004566b20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main
recall latency nDoc fanout maxConn beamWidth visited index ms
['java', '-cp', '/home/ec2-user/candidate/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/ec2-user/candidate/lucene/lucene/sandbox/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/misc/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/facet/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/common/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/analysis/icu/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queryparser/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/grouping/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/suggest/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/highlighter/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/codecs/build/classes/java/main:/home/ec2-user/candidate/lucene/lucene/queries/build/classes/java/main:/home/ec2-user/.gradle/caches/modules-2/files-2.1/com.carrotsearch/hppc/0.9.1/4bf4c51e06aec600894d841c4c004566b20dd357/hppc-0.9.1.jar:/home/ec2-user/candidate/luceneutil/lib/HdrHistogram.jar:/home/ec2-user/candidate/luceneutil/build:/home/ec2-user/candidate/luceneutil/src/main', '--add-modules', 'jdk.incubator.vector', '-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false', 'KnnGraphTester', '-ndoc', '400000', '-maxConn', '48', '-beamWidthIndex', '200', '-fanout', '200', '-topK', '10', '-dim', '768', '-docs', '/home/ec2-user/data-prep/wiki768.train', '-reindex', '-search', '/home/ec2-user/data-prep/wiki768.test', '-metric', 'angular', '-quiet']
WARNING: Using incubator modules: jdk.incubator.vector
0.991 6.98 400000 200 48 200 210 1913700 1.00 post-filter
real 45m3.266s
user 40m8.451s
sys 4m54.290s
@jmazanec15 thank you for running the experiments: what the speed ups?
@searchivarius For this, I didnt track latency. I was just checking if there is a change in recall when using the transformation vs. when not using the transformation based on the experiment vespa ran.
@jmazanec15 I will try to replicate later today. Quick question, did you merge to a single segment? This will have a dramatic change in recall as searching multiple segments gives you much higher recall with higher latency.
@benwtrent I uncommented these 2 lines: https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java#L699-L702 and set max buffer to 1000000.
This is the index as well:
$ ls -la wiki768.train-48-200.index
total 1232744
drwxr-xr-x. 2 ec2-user ec2-user 16384 Jul 20 05:50 .
drwxr-xr-x. 12 ec2-user ec2-user 16384 Jul 19 23:26 ..
-rw-r--r--. 1 ec2-user ec2-user 159 Jul 20 05:50 _6.fdm
-rw-r--r--. 1 ec2-user ec2-user 1619007 Jul 20 05:50 _6.fdt
-rw-r--r--. 1 ec2-user ec2-user 1437 Jul 20 05:50 _6.fdx
-rw-r--r--. 1 ec2-user ec2-user 195 Jul 20 05:50 _6.fnm
-rw-r--r--. 1 ec2-user ec2-user 464 Jul 20 05:50 _6.si
-rw-r--r--. 1 ec2-user ec2-user 1228800100 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vec
-rw-r--r--. 1 ec2-user ec2-user 9625 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vem
-rw-r--r--. 1 ec2-user ec2-user 31838005 Jul 20 05:50 _6_Lucene95HnswVectorsFormat_0.vex
-rw-r--r--. 1 ec2-user ec2-user 154 Jul 20 05:50 segments_6
-rw-r--r--. 1 ec2-user ec2-user 0 Jul 19 22:29 write.lock
@benwtrent I uncommented these 2 lines: https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java#L699-L702 and set max buffer to 1000000.
Edit: I take that back. I dont think I compiled with these changes. But I did see one segment produced in the end (_6_), suggesting that merge to 1 segment did happen. Regardless, I will re-run with the changes.
Update: I passed -forceMerge to KnnGraphTester and confirmed recall was again 0.991, confirming results above.
@jmazanec15 I followed your steps with the same data (forcemerging as well)
Instead of using dot_product
as it is, I instead focused on the non-negative case (which is what it would be we supported this). So I used your piecewise transformation (negatives are between 0-1 and positives are unscaled scores of 1+).
This is what I got:
recall latency nDoc fanout maxConn beamWidth visited index ms
0.989 2.74 400000 200 32 200 210 683712 1.00 post-filter
So, 0.989 recall at 2.7ms per query taking 683712ms
to build the index. Not too shabby. Its interesting how the scaling slightly changes the recall number.
We should verify this is ok by feed the docs in a random order. We might be getting lucky in the graph building.
I updated the script for gathering the data to handle adversarial cases of magnitudes in order and reverse order.
I have ran the in-order version so far, testing the rest now.
ORDERED
WARNING: Gnuplot module not present; will not make charts
recall latency nDoc fanout maxConn beamWidth visited index ms
0.741 0.33 400000 0 32 200 10 0 1.00 post-filter
0.979 1.67 400000 90 32 200 100 0 1.00 post-filter
0.992 2.89 400000 190 32 200 200 0 1.00 post-filter
thank you @benwtrent you didn't try the transformer yet, did you? You can easily convert vectors using, e.g., numpy, it's along the lines of adding one extra dimension that is zero for the query and the document vector D becomes:
Old dimensions are normalized by the max document norm: D/max_doc_norm One "fake" dimension is added: 1 - sqrt(|D|^2/(max_doc_norm^2))
@searchivarius I haven't. Here are the "reversed" numbers, obviously, this is where there is an issue in the adversarial case:
recall latency nDoc fanout maxConn beamWidth visited index ms
0.147 0.31 400000 0 32 200 10 0 1.00 post-filter
0.526 1.78 400000 90 32 200 100 0 1.00 post-filter
0.679 3.16 400000 190 32 200 200 0 1.00 post-filter
0.859 6.76 400000 490 32 200 500 0 1.00 post-filter
I can see about testing with a transformed set of vectors soonish.
Unless @jmazanec15 gets to testing the transformed vectors in reverse order before I do ;)
@benwtrent make sure to set maxConn to 48.
Also, I see I made a mistake setting fanout to 200 - should be 190 as you did.
Unless @jmazanec15 gets to testing the transformed vectors in reverse order before I do ;)
Yes, I can run this - if I cannot get to it today, I will get to it tomorrow.
One last thing: From these results, we are trying to decide if transformation is required now, correct?
@benwtrent make sure to set maxConn to 48.
🤦 yep! Here is with the higher max conn. Sort of better.
recall latency nDoc fanout maxConn beamWidth visited index ms
0.145 0.35 400000 0 48 200 10 0 1.00 post-filter
0.553 1.94 400000 90 48 200 100 0 1.00 post-filter
0.709 3.47 400000 190 48 200 200 0 1.00 post-filter
0.878 7.92 400000 490 48 200 500 0 1.00 post-filter
One last thing: From these results, we are trying to decide if transformation is required now, correct?
I think so. I honestly don't know if we want to worry about this purposefully adversarial case :/. If things are random, Lucene does perfectly well as is.
@benwtrent I don't think there's any truly adversarially robust ML algorithm. With PGD I can drive accuracy of any unprotected DL model to zero. Protected models have low clean accuracy so you can't use them in production
🤦 yep! Here is with the higher max conn. Sort of better.
Right, I was thinking this might explain the recall descrepency for the dotproduct score change (0.989 vs 0.991)
I ran the tests for non-transformed and the numbers seem pretty similar across the board:
### Random (default order)
recall latency nDoc fanout maxConn beamWidth visited index ms
0.715 0.79 400000 0 48 200 10 1910428 1.00 post-filter
0.973 3.87 400000 90 48 200 100 1923226 1.00 post-filter
0.990 6.76 400000 190 48 200 200 1927580 1.00 post-filter
0.998 13.78 400000 490 48 200 500 1917602 1.00 post-filter
### Ascend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.771 0.89 400000 0 48 200 10 2093236 1.00 post-filter
0.983 4.45 400000 90 48 200 100 2095450 1.00 post-filter
0.993 7.88 400000 190 48 200 200 2094090 1.00 post-filter
0.998 16.08 400000 490 48 200 500 2112938 1.00 post-filter
### Descend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.710 0.79 400000 0 48 200 10 1915806 1.00 post-filter
0.973 3.73 400000 90 48 200 100 1910817 1.00 post-filter
0.991 6.55 400000 190 48 200 200 1898517 1.00 post-filter
0.998 13.25 400000 490 48 200 500 1912997 1.00 post-filter
@benwtrent For your results, I see that visited was 0 which might mean there is some kind of bug.
I transformed the data (thanks @searchivarius for help), and I got results that had overall lower recall, but were a little bit faster:
### Random (default order)
recall latency nDoc fanout maxConn beamWidth visited index ms
0.359 0.36 400000 0 48 200 10 1464332 1.00 post-filter
0.728 1.39 400000 90 48 200 100 1457250 1.00 post-filter
0.801 2.43 400000 190 48 200 200 1471881 1.00 post-filter
0.874 5.28 400000 490 48 200 500 1458984 1.00 post-filter
### Ascend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.289 0.31 400000 0 48 200 10 1315149 1.00 post-filter
0.705 1.17 400000 90 48 200 100 1312877 1.00 post-filter
0.794 2.00 400000 190 48 200 200 1316609 1.00 post-filter
0.877 4.32 400000 490 48 200 500 1303967 1.00 post-filter
### Descend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.211 1.20 400000 0 48 200 10 2321339 1.00 post-filter
0.691 6.57 400000 90 48 200 100 2312672 1.00 post-filter
0.814 11.75 400000 190 48 200 200 2313213 1.00 post-filter
0.926 26.31 400000 490 48 200 500 2307567 1.00 post-filter
Based on these results and the paper's @searchivarius shared, I think its probably okay to not add this transform now.
Hi @jmazanec15 and @benwtrent : thanks a lot for testing. For higher recalls (somewhat higher or lower than 0.8) transformation seems to lead to substantial increase in latency. Not only for random, but also for ascend and descend mode.
@benwtrent For your results, I see that visited was 0 which might mean there is some kind of bug.
No, visited was correct, that 0
was for index build time. I only build the index once and then run the queries multiple times with different fanOut parameters. This way I don't pay the cost of reindex on every run unnecessarily :).
Thank you both for all this testing. I will verify the "reversed" numbers as those have the biggest discrepancy between @jmazanec15 results and mine.
The only difference I know of is that I did not allow negative scores and instead used the piecewise transformation in the original issue comment.
OK, I reran my experiments. I ran two, one with reverse
non-transformed (so dimension within knnPerf is 768) and one with reverse transformed (dimensions are 769).
recall latency nDoc fanout maxConn beamWidth visited index ms
0.145 0.38 400000 0 48 200 10 0 1.00 post-filter
0.553 2.05 400000 90 48 200 100 0 1.00 post-filter
0.709 3.66 400000 190 48 200 200 0 1.00 post-filter
0.878 8.05 400000 490 48 200 500 0 1.00 post-filter
recall latency nDoc fanout maxConn beamWidth visited index ms
0.211 0.49 400000 0 48 200 10 0 1.00 post-filter
0.691 2.80 400000 90 48 200 100 0 1.00 post-filter
0.814 5.14 400000 190 48 200 200 0 1.00 post-filter
0.926 11.31 400000 490 48 200 500 0 1.00 post-filter
Recall seems improved for me. Latency increases in the transformed data. I bet part of this is also the overhead of dealing with CPU execution lanes in Panama as its no longer a "nice" number of dimensions.
So, my transformed
numbers match exactly @jmazanec15 results. However, I am getting some extreme discrepancy on my non-transformed.
@jmazanec15 here is the code I used to generate my "reverse" non transformed data. Could you double check and make sure your descending
case data does the same?
There is something significant here that we are missing.
import numpy as np
import pyarrow.parquet as pq
tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb'])
tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb'])
tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb'])
tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb'])
np1 = tb1[0].to_numpy()
np2 = tb2[0].to_numpy()
np3 = tb3[0].to_numpy()
np4 = tb4[0].to_numpy()
np_total = np.concatenate((np1, np2, np3, np4))
#Have to convert to a list here to get
#the numpy ndarray's shape correct later
#There's probably a better way...
flat_ds = list()
for vec in np_total:
flat_ds.append(vec)
#Shape is (485859, 768) and dtype is float32
np_flat_ds = np.array(flat_ds)
magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
indices = np.argsort(magnitudes)
np_flat_ds_sorted = np_flat_ds[indices]
with open("wiki768.reversed.train", "w") as out_f:
np.flip(np_flat_ds_sorted).tofile(out_f)
@benwtrent @jmazanec15
Recall seems improved for me. Latency increases in the transformed data. I bet part of this is also the overhead of dealing with CPU execution lanes in Panama as its no longer a "nice" number of dimensions.
You need to look at the curves. The transform changes both recall and latency. The key question is: are we still on the same Pareto curve or not. Because if we are, getting a higher recall is merely a matter of choosing a larger M or ef, you do not need to support transformer in Lucene.
Currently, VectorSimilarityFunction.DOT_PRODUCT function can return negative scores if the input vectors are not normalized. For ref, this is the method:
While in the method javadoc there is a warning to normalize the vectors before using, I am wondering if we can get rid of this by mapping negative scores between 0 and 1 and positive scores between 1 and Float.Max with:
and let the user worry about normalization
Related issue: https://github.com/opensearch-project/k-NN/issues/865