Closed asfimport closed 1 year ago
Robert Muir (@rmuir) (migrated from JIRA)
I don't "strongly object" but I question the approach of just raising the limit to satisfy whatever shitty models people come up with. At some point we should have a limit, and people should do dimensionality reduction.
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
I also don't have an objection to increasing it a bit. But along the same lines as Robert's point, it'd be good to think about our decision making process – otherwise we'd be tempted to continuously increase it. I've already heard users requesting 12288 dims (to handle OpenAI DaVinci embeddings).
Two possible approaches I could see:
I feel a bit better about approach 2 because I'm not confident I could come up with a statement about a "reasonable max dimension", especially given the fast-moving research.
Robert Muir (@rmuir) (migrated from JIRA)
I think the major problem is still no Vector API in the java APIs. It changes this entire conversation completely when we think about this limit.
if openjdk would release this low level vector API, or barring that, maybe some way to MR-JAR for it, or barring that, maybe some intrinsics such as SloppyMath.dotProduct and SloppyMath.matrixMultiply, maybe java wouldn't become the next COBOL.
Stanislav Stolpovskiy (migrated from JIRA)
I don't think there is a trend to increase dimensionality. Only few models have feature dimensions more than 2048.
Most of modern neural networks (ViT and whole Bert family) have dimensions less than 1k.
However there are still many models like ms-resnet or EfficientNet that operate in range from 1k to 2048.
And they are most common models for image embedding and vector search.
Current limit is forcing to do dimensionally reduction for pretty standard shapes.
Michael Sokolov (@msokolov) (migrated from JIRA)
We should not be imposing an arbitrary limit that prevents people with CNNs (image-processing models) from using this feature. It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves. Of course we don't know where that place is that we might want to optimize in the future (Rob and I discussed an idea using all-integer math that would suffer from overflow, but still we should not just allow MAX_INT dimensions I think? To me a limit like 16K makes sense – well beyond any stated use case, but not effectively infinite?
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)
@sstolpovskiy
@msokolov Thanks for providing your suggestions. It looks like we clearly see the need for upto 2048 dims for images, so I will be merging the linked PR.
Robert Muir (@rmuir) (migrated from JIRA)
My questions are still unanswered. Please don't merge the PR when there are standing objections!
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)
Sorry, may be I should have provided more explanation.
Robert Muir (@rmuir) (migrated from JIRA)
The problem is that nobody will ever want to reduce the limit in the future. Let's be honest, once we support a limit of N, nobody will want to ever make it smaller because of the potential users who wouldn't be able to use it anymore.
So because this is a "one-way" decision, it needs serious justification, benchmarks, etc etc. Regardless of how the picture looks, its definitely not something we should be "rushing" into 9.3
Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)
Got it, thanks, I will not rush, and will try to provide benchmarks.
Michael Wechner (@michaelwechner) (migrated from JIRA)
Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher?
Marcus Eagan (@marcussorealheis) (migrated from JIRA)
@michaelwechner You are free to increase the dimension limit as it is a static variable and Lucene is your oyster. However, @erikhatcher has Seared in my mind that this a long term fork ok Lucene is a bad idea for many reasons.
My argument is for increasing the limit of dimensions is not to suggest that there is a better fulcrum in the performance tradeoff balancer, but that more users testing Lucene is good for improving the feature.
Open AI's Da Vinci is one such model but not the only
I've had customers ask for 4096 based on the performance they observe with question an answering. I'm waiting on the model and will share when I know. If customers want to introduce rampant numerical errors in their systems, there is little we can do for them. Don't take my word on any of this yet. I need to bring data and complete evidence. I'm asking my customers why they cannot do dimensional reduction.
Michael Sokolov (@msokolov) (migrated from JIRA)
> Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher?
I think the idea is to protect ourselves from accidental booboos; this could eventually get exposed in some shared configuration file, and then if somebody passes MAX_INT it could lead to allocating huge buffers somewhere and taking down a service shared by many people/groups? Hypothetical, but it's basically following the principle that we should be strict to help stop people shooting themselves and others in the feet. We may also want to preserve our ability to introduce optimizations that rely on some limits to the size, which would become difficult if usage of larger sizes became entrenched. (We can't so easily take it back once it's out there). Having said that I still feel a 16K limit, while allowing for models that are beyond reasonable, wouldn't cause any of these sort of issues, so that's the number I'm advocating.
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves.
Mike's perspective makes sense to me too. I'd be supportive of increasing the limit to an upper bound. Maybe we could run a test with \~1 million synthetic vectors with the proposed max dimension (\~16K) to check there are no failures or unexpected behavior?
Robert Muir (@rmuir) (migrated from JIRA)
My main concern is that it can't be undone, as i mentioned. Nobody will be willing to go backwards. It impacts more than current implementation, it impacts future implementations as well (different algorithms and data structures). If something like 16k dimensions are allowed it may prevent even simple optimizations (such as 8-bit width). So its important to be very conservative.
This is why I make a big deal about it, because of the "one-way" nature of the backwards compatibility associated with this change. It seems this is still not yet understood or appreciated.
Historically, users fight against every limit we have in lucene, so when people complain about this one, it doesn't bother me (esp when it seems related to one or two bad models/bad decisions unrelated to this project). But these limits are important, especially when features are in their infancy, without them, there is less flexibility and you can find yourself easily "locked in" to a particular implementation.
Robert Muir (@rmuir) (migrated from JIRA)
It is also terrible that this issue says 2048 but somehow that already blew up to 16k here.
-1 to 16K. Its unnecessarily large and puts the project at risk in the future. We can debate 2048.
Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits?
Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits?
Actually it is a one-line change (without any garantees), see https://github.com/apache/lucene/pull/874/files
If you really want to shoot you in the foot: Download source code of Lucene in the version you need for your Elasticsearch instance (I assume you coming from https://github.com/elastic/elasticsearch/issues/92458), patch it with #874, and then run './gradlew distribution'. Copy the JAR files into your ES districution. Done.
But it is not sure if this will blos up and indexes created by that won't read anymore with standard Lucene
Why I was doing that suggestion: If you are interested, try it out with your dataset and your Elasticsearch server and report back! Maybe you will figure out that performance does not work or memory usage is too high.
I'll preface this by saying I am also skeptical that going beyond 1024 makes sense for most use cases and scaling is a concern. However, amidst the current excitement to try and use openai embeddings the first cut at choosing a system to store and use those embeddings was Elasticsearch. Then the 1024 limit was run into and so various folks are looking at other alternatives largely because of this limit.
The use cases tend to be Q/A, summarization, and recommendation systems for WordPress and Tumblr. There are multiple proof of concept systems people have built (typically on top of various typscript, javascript, or python libs) which use the openai embeddings directly (and give quite impressive results). Even though I am pretty certain that reducing the dimensions will be a better idea for many of these, the ability to build and prototype on higher dimensions would be extremely useful.
@uschindler @rmuir FWIW We are interested in using Lucene's kNN with 1536 dimensions in order to use OpenAI's embeddings API. We benchmarked a patched Lucene/Solr. We fully understand (we measured it :-P) that there is an increase in memory consumption and latency. Sure thing.
We have applications where dev teams have chosen to work with OpenAI embeddings and where the number of records involved and requests per second make the trade offs of memory and latency perfectly acceptable.
There is a great deal of enthusiasm around OpenAI and releasing a working application ASAP. For many of these the resource cost of 1536 dimensions is perfectly acceptable against the alternative of delaying a pilot to optimize further.
Our work would be a lot easier if Lucene's kNN implementation supported 1536 dimensions without need for a patch.
I'm reminded of the great maxBooleanClauses debate. At least that limit is user configurable (for the system deployer; not the end user doing a query) whereas this new one for kNN is not.
I can understand how we got to this point -- limits often start as hard limits. The current limit even seems high based on what has been said. But users have spoken here on a need for them to configure Lucene for their use-case (such as experimentation within a system they are familiar with) and accept the performance consequences. I would like this to be possible with a System property. This hasn't been expressly asked? Why should Lucene, just a library that doesn't know what's best for the user, prevent a user from being able to do that?
This isn't an inquiry about why limits exist; of course systems need limits.
Hi @dsmiley I updated the dev discussion on the mailing list: [Proposal] Remove max number of dimensions for KNN vectors
And proceeded with a pragmatic new mail thread, where we just collect proposals with a motivation (no discussion there): Dimensions Limit for KNN vectors - Next Steps
Feel free to participate! My intention is to act relatively fast (and then also operate Solr side). It's a train we don't need/want to miss!
The rabbit hole that is trying to store Open AI embeddings in Elasticsearch eventually leads here. I read the entire thread and unless I am missing something, the obvious move to make the limit configurable (up to a point) or at a minimum, increase the limit to 1536 to support the text-embedding-ada-002
model. In other words, there should be a compelling reason not to increase the limit beyond the fact that it will hard to reduce in the future.
Cross posting here because I responded to the PR instead of this issue.
...why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities?
I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?
If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for.
I may be wrong but it seems like this is where most of the lucene committers here are settling?
Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for MongoDB inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these things hostage)? So maybe java is still not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option?
Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering over using dimensionality reduction? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will reasonably perform without JEP 426 or better SIMD support that's only available in newer JDKs? Can anyone smart here post their benchmarks to substantiate their claims? I know Pinecone (and others) have blogged about their love for RUST for these kinds of applications. Should Lucene just leave this to the job of alternative Search APIs? Maybe something like Tantivy or Rucene? Or is it time we explore a new optional Lucene Vector module that supports cutting edge JDK features through gradle tooling for optimizing the vector use case?
Interested what others think.
While this is not a critique on Lucene's attempt to utilize SIMD via OpenJDK, or any proposed ideas here, it's challenging to envision Lucene emerging as the leading solution for large-scale vector similarity search. This doesn't necessarily imply whether Lucene should or should not integrate such a feature. However, if one were to suggest that this is a critical issue for Lucene's survival, I would question the likelihood of a Java-based engine, laden with keyword search complexities, rising to the top as a vector search solution, regardless of SIMD integration. I would hesitate to wager on its ability to compete with systems exclusively focused on vector similarity search, equipped with first-rate GPU support, and designed to work with existing and future AI-oriented hardware. None of these systems would be developed in Java, nor would they compete with Lucene in the realm of traditional search.
Though it might be beneficial and convenient for Lucene to accommodate this feature, unless the project undergoes a complete overhaul, its survival will likely hinge on the success or failure of its keyword search and faceting capabilities, along with other related features. It appears to be a significant jump to discard all these features into the 'COBOL pile' due to the integration of embeddings. A more plausible scenario is that they will coexist harmoniously, complementing each other's strengths.
Nice little rewrite ChatGPT did there.
[...] it's challenging to envision Lucene emerging as the leading solution for large-scale vector similarity search [...]
For anyone out there who knows IR's SOTA still is either BM25 or a combination of BM25 and HNSW similarity search, no it isn't. For most of us, actually, Lucene is ideally positionned to remain the IR leader, especially if BM25+HNSW is supported on the same level as BM25-only use cases. Having Java Vector API BM25 and HNSW implementations certainly won't hurt. But letting Lucene users decide and benchmark their own dense vector size (as all Lucene competitors do) is a must.
And I hereby volonteer to help for either or all of the Java Vector API implementations, BM25+HNSW combination workflow helpers, and HNSW support of arbitrarily-sized vectors. Not limiting my help to just design and programming, but also benchmarking, documenting and whatnot. Who and where should I talk to about that?
As for Rust, it's just not on par with Java as far as concurrency goes. Rust is no one's language of choice to build IR distributed web services.
Nice little rewrite ChatGPT did there.
❗ ❗ ❗
I also vote for the increase of dimensions. I use GPT Ada2 for embeddings and I have to recompile Lucene just to have the constant value increased. Having a system property for changing it would be nice.
Just adding a perspective, we're evaluating options for indexing vectors and need to support OpenAI's second-generation text embedding that outputs 1536 dimensions.
Raising Lucene's max to something like 2048 makes a lot of sense to me. Things are moving very quickly in the LLM space at the moment. I can see there are concerns about performance and scale, but today people don't choose Lucene to index vectors when they need to leverage GPU performance and have massive amounts of vector data. Anyway, just my two cents. :-)
Here's an example why making the vector dimensions configurable is a bad idea: #12281 This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?
Copying and pasting here, just for visibility:
Here's an example why making the vector dimensions configurable is a bad idea: #12281 This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?
I may sound like somebody who contradicts another just for the sake of doing so, but I do genuinely believe these kind of discoveries support the fact that making it configurable is actually a good idea: We are not changing a production system here but we are changing a library. Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues. I suspect we are not recommending anywhere here to go to prod with un-tested and un-benchmarked vector sizes anyway
Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues.
It also shows that this causes a long-tail of issues:
In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much.
So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors).
At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations.
I think a library should empower a user to discover what works (and doesn't) for them, rather than playing big brother and insist it knows best that there's no way some high setting could ever work for any user. Right? By making it a system property that does not need to be configured for <= 1024, it should raise a red flag to users that they are venturing into unusual territory. i.e. they've been warned. They'd have to go looking for such a setting and see warnings; it's not something a user would do accidentally either.
if we raise the number of dimensions people will then start claiming for higher precision in calculations,
LOL People may ask for whatever they want :-) including using/abusing a system beyond its intended scope. So what? BTW I've thoroughly enjoyed seeing several use cases of my code in Lucene/Solr that I had never considered yet worked really well for a user :-D. Pure joy. Of course not every request makes sense to us. I'd rather endure such than turn users away from Lucene that we can support trivially today.
In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much.
So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors).
At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations.
@uschindler , I am not convinced but it's fine to have different opinions! I do agree we should improve all the improvable and at the same time, in parallel, give users flexibility to experiment:
We may have different opinions here and that's fine, but my intent as a committer is to build the best solution for the community rather than the best solution according to my ideas.
You know, if we wanted sub-ms responses all the time we could set a hard limit to 1024 chars per textual field and allow a very low number of fields, but then would Lucene attract any user at all?
I would like to renew the issue in light of the recent integration of incubating Panama Vector API, as indexing of vectors with it much faster.
We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled).
Dataset:
text
field embedded with OpenAI text-embedding-ada-002
model, 1536 dimsmaxConn: 16, beamWidthIndex: 100
Apple M1 laptop
We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled).
I found this very strange at first :)
But then I read more closely, and I think what you meant is indexing 1024 dims without Panama (SIMD vector instructions) is slower than indexing 1536 dims with Panama enabled? Which is really quite impressive.
Do we know what gains we see at search time going from 1024 -> 1536?
Interestingly it was only an Apple M1. This one only has 128 bits vector size and only 2 PU (the 128 bits is in the spec of CPU, but Robert told me about number of PUs; I found no info on that in wikichip). So I would like to also see the difference on a real cool AVX512 machine with 4 PUs.
So unfortunately the Apple M1 is a bit limited but it is still good enough to outperform the scalar impl. Cool. Now please test on a real Intel Server CPU. 😍
In general I am fine with rising vectors to 2048 dims. But apply that limit only to the HNSW codec. So check should not in the field type but in the codec.
@mikemccand Indeed, exactly as said, sorry for being unclear. We have not checked search, will work on that.
@uschindler Thanks, indeed, we need tests on other machines. +1 for raising dims to 2048 in HNSW codec.
I ran @mayya-sharipova's exact same benchmark/test on my machine. Here are the results.
Dataset:
text
field embedded with OpenAI text-embedding-ada-002
model, 1536 dimsmaxConn: 16, beamWidthIndex: 100
Linux, x86_64 11th Intel Core i5-11400 @ 2.60GHz - AVX 512
JDK 20.0.1
Panama(bits) | dims | time (secs) |
---|---|---|
No | 1024 | 3136 |
Yes(512) | 1536 | 2633 |
So the test run with 1536 dims and Panama enabled at AVX 512 was 503 secs (or ~16%) faster than the run with 1024 dims and No Panama.
512
Full output from the test runs can be see here https://gist.github.com/ChrisHegarty/ef008da196624c1a3fe46578ee3a0a6c.
Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)? We are still talking about an hour to index < 3M docs, so I think the performance is not good. As i've said before, i never thought 1024 was a good situation either. 768 is also excruciating. Purpose of the vectorization is just to alleviate some of the pain. It is like giving the patient an aspirin, it doesn't really fix the problem.
I am extremely curious, what should we consider a good performance to index <3M docs? I mean, I agree we should always try to improve things and aim for the stars, but as maintainers of a library who are we to decide what's acceptable and what's not for the users? Is it because of a comparison with other libraries or solutions? They may have many reasons for being faster (and definitely we should take inspiration) If we look to : https://home.apache.org/~mikemccand/lucenebench/indexing.html , we clearly improved the indexing throughput substantially over the years, does this mean that Lucene back in 2011 should have not committed additional features/improvements because for some people (people from the future) "it was slow"?
@rmuir
Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)?
I've done the test and surprising indexing time decreased substantially. It is almost 2 times faster to index with Lucene's defaults than with 2Gb RamBuffer at the expense that we end up with a bigger number of segments.
RamBuffer Size | Indexing time | Num of segments |
---|---|---|
16 Mb | 1877 s | 19 |
1994 Mb | 3141s | 9 |
Leaving a higher number of segments dodges the merge costs, I think.
This benchmark really only measures the flushing cost, as ConcurrentMergeScheduler
is used, so merges run in background threads. So the improvement makes sense to me as the cost of adding vectors into a HNSW graph increases as the size of the HNSW graph increases. If we want to get a sense of the number of docs per second per core that we support with a 2GB RAM buffer vs. the 16MB default, using SerialMergeScheduler
would be a better choice.
Last comment is already a couple of months old, so please let me clarify the status of this initiative. If there's a chance it's going to be merged? If there's any blocker or action item that prevents one from being merged?
The context of my inquiry is that Lucene-based solutions (e.g. OpenSearch) are commonly deployed within enterprises, which makes them good candidates to experiment with vector search and commercial LLM-offerings, without deploying and maintaining specialized technologies. Max dimensionality of 1024, however, puts certain restrictions (similar thoughts are here https://arxiv.org/abs/2308.14963).
Hi, actually this issue is already resolved, although the DEFAULT did not change (and won't change due to performance risks), see here: https://github.com/apache/lucene/pull/12436 - this PR allows users of Lucene to raise the limit (at least for HNSW codec) on codec level.
To implement (on your own risk), create your own KnnVectorsFormat
and let it return a different number for getMaxDimensions()
. Then construct your own codec from it and index your data.
You can do this with Lucene 9.8+
OpenSearch and Elasticsearch and Solr will have custom limits in their code (based on this approach).
@mayya-sharipova: Should we close this issue or are there any plans to also change the default maximum? I don't think so.
I think we should close it for sure.
The current maximum allowed number of dimensions is equal to 1024. But we see in practice a couple well-known models that produce vectors with > 1024 dimensions (e.g mobilenet_v2 uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing max dims to
2048
will satisfy these use cases.I am wondering if anybody has strong objections against this.
Migrated from LUCENE-10471 by Mayya Sharipova (@mayya-sharipova), 6 votes, updated Aug 15 2022 Pull requests: https://github.com/apache/lucene/pull/874