asfimport commented 2 years ago

The current maximum allowed number of dimensions is equal to 1024. But we see in practice a couple well-known models that produce vectors with > 1024 dimensions (e.g mobilenet_v2 uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing max dims to 2048 will satisfy these use cases.

I am wondering if anybody has strong objections against this.

Migrated from LUCENE-10471 by Mayya Sharipova (@mayya-sharipova), 6 votes, updated Aug 15 2022 Pull requests: https://github.com/apache/lucene/pull/874

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't "strongly object" but I question the approach of just raising the limit to satisfy whatever shitty models people come up with. At some point we should have a limit, and people should do dimensionality reduction.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I also don't have an objection to increasing it a bit. But along the same lines as Robert's point, it'd be good to think about our decision making process – otherwise we'd be tempted to continuously increase it. I've already heard users requesting 12288 dims (to handle OpenAI DaVinci embeddings).

Two possible approaches I could see:

We do more research on the literature and decide on a reasonable max dimension. If a user wants to go beyond that, they should reconsider the model or perform dimensionality reduction. This would encourage users to think through their embedding strategy to optimize for performance. The improvements can be significant, since search time scales with vector dimensionality.
Or we take a flexible approach where we bump the limit to a high upper bound. This upper bound would be based on how much memory usage is reasonable for one vector (similar to the max term size?)

I feel a bit better about approach 2 because I'm not confident I could come up with a statement about a "reasonable max dimension", especially given the fast-moving research.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the major problem is still no Vector API in the java APIs. It changes this entire conversation completely when we think about this limit.

if openjdk would release this low level vector API, or barring that, maybe some way to MR-JAR for it, or barring that, maybe some intrinsics such as SloppyMath.dotProduct and SloppyMath.matrixMultiply, maybe java wouldn't become the next COBOL.

asfimport commented 2 years ago

Stanislav Stolpovskiy (migrated from JIRA)

I don't think there is a trend to increase dimensionality. Only few models have feature dimensions more than 2048.

Most of modern neural networks (ViT and whole Bert family) have dimensions less than 1k.

However there are still many models like ms-resnet or EfficientNet that operate in range from 1k to 2048.

And they are most common models for image embedding and vector search.

Current limit is forcing to do dimensionally reduction for pretty standard shapes.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

We should not be imposing an arbitrary limit that prevents people with CNNs (image-processing models) from using this feature. It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves. Of course we don't know where that place is that we might want to optimize in the future (Rob and I discussed an idea using all-integer math that would suffer from overflow, but still we should not just allow MAX_INT dimensions I think? To me a limit like 16K makes sense – well beyond any stated use case, but not effectively infinite?

asfimport commented 2 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@sstolpovskiy @msokolov Thanks for providing your suggestions. It looks like we clearly see the need for upto 2048 dims for images, so I will be merging the linked PR.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My questions are still unanswered. Please don't merge the PR when there are standing objections!

asfimport commented 2 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Sorry, may be I should have provided more explanation.

First this issue is only about to have max dims up to 2048. We can create a separate issue to discuss other upper limits if there is a need for them.
According to our ML experts resnet is an industry standard for images and it can need up to 2048 dims. It would be good that we can support it in Lucene.
I can also run a performance test of 1M vectors of 2048 dims to see how much time and memory it may take to index and search these big vectors.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The problem is that nobody will ever want to reduce the limit in the future. Let's be honest, once we support a limit of N, nobody will want to ever make it smaller because of the potential users who wouldn't be able to use it anymore.

So because this is a "one-way" decision, it needs serious justification, benchmarks, etc etc. Regardless of how the picture looks, its definitely not something we should be "rushing" into 9.3

asfimport commented 2 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Got it, thanks, I will not rush, and will try to provide benchmarks.

asfimport commented 2 years ago

Michael Wechner (@michaelwechner) (migrated from JIRA)

Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher?

asfimport commented 2 years ago

Marcus Eagan (@marcussorealheis) (migrated from JIRA)

@michaelwechner You are free to increase the dimension limit as it is a static variable and Lucene is your oyster. However, @erikhatcher has Seared in my mind that this a long term fork ok Lucene is a bad idea for many reasons.

[\~rcmuir] I agree with you on "whatever shitty models." They are here, and more are coming. With respect to the vector API, Oracle is doing an interesting bit of work in Open JDK 17 to improve their vector API. They've added support for Intel's short vector math library, which will improve. The folk at OpenJDK exploit the Panama APIs. There are several hardware accelerations they are yet to exploit, and many operations will fall back to scalar code.

My argument is for increasing the limit of dimensions is not to suggest that there is a better fulcrum in the performance tradeoff balancer, but that more users testing Lucene is good for improving the feature.

Open AI's Da Vinci is one such model but not the only

I've had customers ask for 4096 based on the performance they observe with question an answering. I'm waiting on the model and will share when I know. If customers want to introduce rampant numerical errors in their systems, there is little we can do for them. Don't take my word on any of this yet. I need to bring data and complete evidence. I'm asking my customers why they cannot do dimensional reduction.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher?

I think the idea is to protect ourselves from accidental booboos; this could eventually get exposed in some shared configuration file, and then if somebody passes MAX_INT it could lead to allocating huge buffers somewhere and taking down a service shared by many people/groups? Hypothetical, but it's basically following the principle that we should be strict to help stop people shooting themselves and others in the feet. We may also want to preserve our ability to introduce optimizations that rely on some limits to the size, which would become difficult if usage of larger sizes became entrenched. (We can't so easily take it back once it's out there). Having said that I still feel a 16K limit, while allowing for models that are beyond reasonable, wouldn't cause any of these sort of issues, so that's the number I'm advocating.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves.

Mike's perspective makes sense to me too. I'd be supportive of increasing the limit to an upper bound. Maybe we could run a test with \~1 million synthetic vectors with the proposed max dimension (\~16K) to check there are no failures or unexpected behavior?

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My main concern is that it can't be undone, as i mentioned. Nobody will be willing to go backwards. It impacts more than current implementation, it impacts future implementations as well (different algorithms and data structures). If something like 16k dimensions are allowed it may prevent even simple optimizations (such as 8-bit width). So its important to be very conservative.

This is why I make a big deal about it, because of the "one-way" nature of the backwards compatibility associated with this change. It seems this is still not yet understood or appreciated.

Historically, users fight against every limit we have in lucene, so when people complain about this one, it doesn't bother me (esp when it seems related to one or two bad models/bad decisions unrelated to this project). But these limits are important, especially when features are in their infancy, without them, there is less flexibility and you can find yourself easily "locked in" to a particular implementation.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It is also terrible that this issue says 2048 but somehow that already blew up to 16k here.

-1 to 16K. Its unnecessarily large and puts the project at risk in the future. We can debate 2048.

aykutfirat commented 1 year ago

Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits?

uschindler commented 1 year ago

Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits?

Actually it is a one-line change (without any garantees), see https://github.com/apache/lucene/pull/874/files

If you really want to shoot you in the foot: Download source code of Lucene in the version you need for your Elasticsearch instance (I assume you coming from https://github.com/elastic/elasticsearch/issues/92458), patch it with #874, and then run './gradlew distribution'. Copy the JAR files into your ES districution. Done.

But it is not sure if this will blos up and indexes created by that won't read anymore with standard Lucene

uschindler commented 1 year ago

Why I was doing that suggestion: If you are interested, try it out with your dataset and your Elasticsearch server and report back! Maybe you will figure out that performance does not work or memory usage is too high.

gibrown commented 1 year ago

I'll preface this by saying I am also skeptical that going beyond 1024 makes sense for most use cases and scaling is a concern. However, amidst the current excitement to try and use openai embeddings the first cut at choosing a system to store and use those embeddings was Elasticsearch. Then the 1024 limit was run into and so various folks are looking at other alternatives largely because of this limit.

The use cases tend to be Q/A, summarization, and recommendation systems for WordPress and Tumblr. There are multiple proof of concept systems people have built (typically on top of various typscript, javascript, or python libs) which use the openai embeddings directly (and give quite impressive results). Even though I am pretty certain that reducing the dimensions will be a better idea for many of these, the ability to build and prototype on higher dimensions would be extremely useful.

FranciscoBorges commented 1 year ago

@uschindler @rmuir FWIW We are interested in using Lucene's kNN with 1536 dimensions in order to use OpenAI's embeddings API. We benchmarked a patched Lucene/Solr. We fully understand (we measured it :-P) that there is an increase in memory consumption and latency. Sure thing.

We have applications where dev teams have chosen to work with OpenAI embeddings and where the number of records involved and requests per second make the trade offs of memory and latency perfectly acceptable.

There is a great deal of enthusiasm around OpenAI and releasing a working application ASAP. For many of these the resource cost of 1536 dimensions is perfectly acceptable against the alternative of delaying a pilot to optimize further.

Our work would be a lot easier if Lucene's kNN implementation supported 1536 dimensions without need for a patch.

dsmiley commented 1 year ago

I'm reminded of the great maxBooleanClauses debate. At least that limit is user configurable (for the system deployer; not the end user doing a query) whereas this new one for kNN is not.

I can understand how we got to this point -- limits often start as hard limits. The current limit even seems high based on what has been said. But users have spoken here on a need for them to configure Lucene for their use-case (such as experimentation within a system they are familiar with) and accept the performance consequences. I would like this to be possible with a System property. This hasn't been expressly asked? Why should Lucene, just a library that doesn't know what's best for the user, prevent a user from being able to do that?

This isn't an inquiry about why limits exist; of course systems need limits.

alessandrobenedetti commented 1 year ago

Hi @dsmiley I updated the dev discussion on the mailing list: [Proposal] Remove max number of dimensions for KNN vectors

And proceeded with a pragmatic new mail thread, where we just collect proposals with a motivation (no discussion there): Dimensions Limit for KNN vectors - Next Steps

Feel free to participate! My intention is to act relatively fast (and then also operate Solr side). It's a train we don't need/want to miss!

ryantbrown commented 1 year ago

The rabbit hole that is trying to store Open AI embeddings in Elasticsearch eventually leads here. I read the entire thread and unless I am missing something, the obvious move to make the limit configurable (up to a point) or at a minimum, increase the limit to 1536 to support the text-embedding-ada-002 model. In other words, there should be a compelling reason not to increase the limit beyond the fact that it will hard to reduce in the future.

nknize commented 1 year ago

Cross posting here because I responded to the PR instead of this issue.

...why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities?

I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?

If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for.

I may be wrong but it seems like this is where most of the lucene committers here are settling?

Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for MongoDB inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these things hostage)? So maybe java is still not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option?

Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering over using dimensionality reduction? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will reasonably perform without JEP 426 or better SIMD support that's only available in newer JDKs? Can anyone smart here post their benchmarks to substantiate their claims? I know Pinecone (and others) have blogged about their love for RUST for these kinds of applications. Should Lucene just leave this to the job of alternative Search APIs? Maybe something like Tantivy or Rucene? Or is it time we explore a new optional Lucene Vector module that supports cutting edge JDK features through gradle tooling for optimizing the vector use case?

Interested what others think.

markrmiller commented 1 year ago

While this is not a critique on Lucene's attempt to utilize SIMD via OpenJDK, or any proposed ideas here, it's challenging to envision Lucene emerging as the leading solution for large-scale vector similarity search. This doesn't necessarily imply whether Lucene should or should not integrate such a feature. However, if one were to suggest that this is a critical issue for Lucene's survival, I would question the likelihood of a Java-based engine, laden with keyword search complexities, rising to the top as a vector search solution, regardless of SIMD integration. I would hesitate to wager on its ability to compete with systems exclusively focused on vector similarity search, equipped with first-rate GPU support, and designed to work with existing and future AI-oriented hardware. None of these systems would be developed in Java, nor would they compete with Lucene in the realm of traditional search.

Though it might be beneficial and convenient for Lucene to accommodate this feature, unless the project undergoes a complete overhaul, its survival will likely hinge on the success or failure of its keyword search and faceting capabilities, along with other related features. It appears to be a significant jump to discard all these features into the 'COBOL pile' due to the integration of embeddings. A more plausible scenario is that they will coexist harmoniously, complementing each other's strengths.

Nice little rewrite ChatGPT did there.

contrebande-labs commented 1 year ago

[...] it's challenging to envision Lucene emerging as the leading solution for large-scale vector similarity search [...]

For anyone out there who knows IR's SOTA still is either BM25 or a combination of BM25 and HNSW similarity search, no it isn't. For most of us, actually, Lucene is ideally positionned to remain the IR leader, especially if BM25+HNSW is supported on the same level as BM25-only use cases. Having Java Vector API BM25 and HNSW implementations certainly won't hurt. But letting Lucene users decide and benchmark their own dense vector size (as all Lucene competitors do) is a must.

And I hereby volonteer to help for either or all of the Java Vector API implementations, BM25+HNSW combination workflow helpers, and HNSW support of arbitrarily-sized vectors. Not limiting my help to just design and programming, but also benchmarking, documenting and whatnot. Who and where should I talk to about that?

As for Rust, it's just not on par with Java as far as concurrency goes. Rust is no one's language of choice to build IR distributed web services.

mikemccand commented 1 year ago

Nice little rewrite ChatGPT did there.

❗ ❗ ❗

dan-caprioara commented 1 year ago

I also vote for the increase of dimensions. I use GPT Ada2 for embeddings and I have to recompile Lucene just to have the constant value increased. Having a system property for changing it would be nice.

mauvo commented 1 year ago

Just adding a perspective, we're evaluating options for indexing vectors and need to support OpenAI's second-generation text embedding that outputs 1536 dimensions.

Raising Lucene's max to something like 2048 makes a lot of sense to me. Things are moving very quickly in the LLM space at the moment. I can see there are concerns about performance and scale, but today people don't choose Lucene to index vectors when they need to leverage GPU performance and have massive amounts of vector data. Anyway, just my two cents. :-)

uschindler commented 1 year ago

Here's an example why making the vector dimensions configurable is a bad idea: #12281 This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?

alessandrobenedetti commented 1 year ago

Copying and pasting here, just for visibility:

Here's an example why making the vector dimensions configurable is a bad idea: #12281 This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?

I may sound like somebody who contradicts another just for the sake of doing so, but I do genuinely believe these kind of discoveries support the fact that making it configurable is actually a good idea: We are not changing a production system here but we are changing a library. Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues. I suspect we are not recommending anywhere here to go to prod with un-tested and un-benchmarked vector sizes anyway

uschindler commented 1 year ago

Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues.

It also shows that this causes a long-tail of issues:

If we would fix the mentioned issue in the way proposed there, the performance would be going down, as a widening conversion to double would break hotspot optimizations and would also not be possible with SIMD in a performant manner. So this is a reason to for now stay with the current limit.

uschindler commented 1 year ago

In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much.

So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors).

At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations.

dsmiley commented 1 year ago

I think a library should empower a user to discover what works (and doesn't) for them, rather than playing big brother and insist it knows best that there's no way some high setting could ever work for any user. Right? By making it a system property that does not need to be configured for <= 1024, it should raise a red flag to users that they are venturing into unusual territory. i.e. they've been warned. They'd have to go looking for such a setting and see warnings; it's not something a user would do accidentally either.

if we raise the number of dimensions people will then start claiming for higher precision in calculations,

LOL People may ask for whatever they want :-) including using/abusing a system beyond its intended scope. So what? BTW I've thoroughly enjoyed seeing several use cases of my code in Lucene/Solr that I had never considered yet worked really well for a user :-D. Pure joy. Of course not every request makes sense to us. I'd rather endure such than turn users away from Lucene that we can support trivially today.

alessandrobenedetti commented 1 year ago

In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much.

So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors).

At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations.

@uschindler , I am not convinced but it's fine to have different opinions! I do agree we should improve all the improvable and at the same time, in parallel, give users flexibility to experiment:

break it
make it super slow with enormous vectors
make it super slow with enormous field content
make it super slow with a great number of fields ... I just think that the more you let your users do (with a reasonable effort), the more we'll gain users and consequentially improvements (it's a fact that more people you involve in a project as users the more you increase the probability of developing contributors and contributions).

We may have different opinions here and that's fine, but my intent as a committer is to build the best solution for the community rather than the best solution according to my ideas.

You know, if we wanted sub-ms responses all the time we could set a hard limit to 1024 chars per textual field and allow a very low number of fields, but then would Lucene attract any user at all?

mayya-sharipova commented 1 year ago

I would like to renew the issue in light of the recent integration of incubating Panama Vector API, as indexing of vectors with it much faster.

We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled).

Test environment

Dataset:
- nq dataset with text field embedded with OpenAI text-embedding-ada-002 model, 1536 dims
KnnGraphTester
maxConn: 16, beamWidthIndex: 100
Apple M1 laptop

Test1:

Lucene 9.7 branch
Panama Vector API not enabled
vector dims=1024 (OpenAi vectors that were cut off to first 1024 dims)
Results: Indexed 2680961 documents in 3287s

Details

``` java -cp "lib/*:classes" -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1024 -ndoc 2680961 -reindex -docs vectors_dims1024.bin -maxConn 16 -beamWidthIndex 100 creating index in vectors_dims1024.bin-16-100.index MS 0 [2023-06-26T11:10:24.765857Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-06-26T11:10:24.782017Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@646d64ab IFD 0 [2023-06-26T11:10:24.783554Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T11:10:24.784291Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-06-26T11:10:24.784338Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T11:10:24.785377Z; main]: 0 ms to checkpoint IW 0 [2023-06-26T11:10:24.785523Z; main]: init: create=true reader=null IW 0 [2023-06-26T11:10:24.790087Z; main]: dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors_dims1024.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c039ac6 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=1994.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@2173f6d9 writer=org.apache.lucene.index.IndexWriter@307f6b8c IW 0 [2023-06-26T11:10:24.790232Z; main]: MMapDirectory.UNMAP_SUPPORTED=true DWPT 0 [2023-06-26T11:19:47.652040Z; main]: flush postings as segment _0 numDocs=460521 IW 0 [2023-06-26T11:19:47.653761Z; main]: 1 ms to write norms IW 0 [2023-06-26T11:19:47.653954Z; main]: 0 ms to write docValues IW 0 [2023-06-26T11:19:47.654032Z; main]: 0 ms to write points IW 0 [2023-06-26T11:19:49.152263Z; main]: 1498 ms to write vectors IW 0 [2023-06-26T11:19:49.166472Z; main]: 14 ms to finish stored fields IW 0 [2023-06-26T11:19:49.166642Z; main]: 0 ms to write postings and finish vectors IW 0 [2023-06-26T11:19:49.167167Z; main]: 0 ms to write fieldInfos DWPT 0 [2023-06-26T11:19:49.167954Z; main]: new segment has 0 deleted docs DWPT 0 [2023-06-26T11:19:49.168030Z; main]: new segment has 0 soft-deleted docs DWPT 0 [2023-06-26T11:19:49.169572Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs DWPT 0 [2023-06-26T11:19:49.169670Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm] .... Indexed 2680961 documents in 3287s ```

Test2

Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048
Panama Vector API enabled
vector dims=1536
Results: Indexed 2680961 documents in 3141s

Details

``` java --add-modules jdk.incubator.vector -cp "lib/*:classes" -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1536 -ndoc 2680961 -reindex -docs vectors.bin -maxConn 16 -beamWidthIndex 100 WARNING: Using incubator modules: jdk.incubator.vector creating index in vectors.bin-16-100.index Jun 26, 2023 10:34:29 A.M. org.apache.lucene.store.MemorySegmentIndexInputProvider INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false MS 0 [2023-06-26T14:34:29.271516Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-06-26T14:34:29.329779Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c IFD 0 [2023-06-26T14:34:29.336415Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T14:34:29.338546Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-06-26T14:34:29.338654Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T14:34:29.347243Z; main]: 2 ms to checkpoint IW 0 [2023-06-26T14:34:29.348255Z; main]: init: create=true reader=null IW 0 [2023-06-26T14:34:29.368686Z; main]: dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=1994.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0 writer=org.apache.lucene.index.IndexWriter@67b467e9 IW 0 [2023-06-26T14:34:29.369224Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Jun 26, 2023 10:34:29 A.M. org.apache.lucene.util.VectorUtilPanamaProvider INFO: Java vector incubator API enabled; uses preferredBitSize=128 DWPT 0 [2023-06-26T14:40:36.945965Z; main]: flush postings as segment _0 numDocs=314897 IW 0 [2023-06-26T14:40:36.949748Z; main]: 2 ms to write norms IW 0 [2023-06-26T14:40:36.950336Z; main]: 0 ms to write docValues IW 0 [2023-06-26T14:40:36.950452Z; main]: 0 ms to write points IW 0 [2023-06-26T14:40:38.639069Z; main]: 1688 ms to write vectors IW 0 [2023-06-26T14:40:38.669749Z; main]: 29 ms to finish stored fields IW 0 [2023-06-26T14:40:38.670044Z; main]: 0 ms to write postings and finish vectors IW 0 [2023-06-26T14:40:38.670847Z; main]: 0 ms to write fieldInfos DWPT 0 [2023-06-26T14:40:38.672893Z; main]: new segment has 0 deleted docs DWPT 0 [2023-06-26T14:40:38.673016Z; main]: new segment has 0 soft-deleted docs DWPT 0 [2023-06-26T14:40:38.675915Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs DWPT 0 [2023-06-26T14:40:38.676120Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm] DWPT 0 [2023-06-26T14:40:38.676311Z; main]: flushed codec=Lucene95 DWPT 0 [2023-06-26T14:40:38.677609Z; main]: flushed: segment=_0 ramUsed=1,945.012 MB newFlushedSize=1,863.46 MB docs/MB=168.985 DWPT 0 [2023-06-26T14:40:38.680696Z; main]: flush time 1735.77025 ms IW 0 [2023-06-26T14:40:38.682741Z; main]: publishFlushedSegment seg-private updates=null IW 0 [2023-06-26T14:40:38.683738Z; main]: publishFlushedSegment _0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4x BD 0 [2023-06-26T14:40:38.687864Z; main]: finished packet delGen=1 now completedDelGen=1 IW 0 [2023-06-26T14:40:38.691420Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4x IFD 0 [2023-06-26T14:40:38.692639Z; main]: now checkpoint "_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4y" [1 segments ; isCommit = false] IFD 0 [2023-06-26T14:40:38.693268Z; main]: now delete 0 files: [] IFD 0 [2023-06-26T14:40:38.693464Z; main]: 1 ms to checkpoint MP 0 [2023-06-26T14:40:38.700301Z; main]: seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4y size=1863.460 MB MP 0 [2023-06-26T14:40:38.701368Z; main]: findMerges: 1 segments MP 0 [2023-06-26T14:40:38.701645Z; main]: allowedSegmentCount=10 vs count=1 (eligible count=1) ... Indexed 2680961 documents in 3141s ```

mikemccand commented 1 year ago

We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled).

I found this very strange at first :)

But then I read more closely, and I think what you meant is indexing 1024 dims without Panama (SIMD vector instructions) is slower than indexing 1536 dims with Panama enabled? Which is really quite impressive.

Do we know what gains we see at search time going from 1024 -> 1536?

uschindler commented 1 year ago

Interestingly it was only an Apple M1. This one only has 128 bits vector size and only 2 PU (the 128 bits is in the spec of CPU, but Robert told me about number of PUs; I found no info on that in wikichip). So I would like to also see the difference on a real cool AVX512 machine with 4 PUs.

So unfortunately the Apple M1 is a bit limited but it is still good enough to outperform the scalar impl. Cool. Now please test on a real Intel Server CPU. 😍

In general I am fine with rising vectors to 2048 dims. But apply that limit only to the HNSW codec. So check should not in the field type but in the codec.

mayya-sharipova commented 1 year ago

@mikemccand Indeed, exactly as said, sorry for being unclear. We have not checked search, will work on that.

@uschindler Thanks, indeed, we need tests on other machines. +1 for raising dims to 2048 in HNSW codec.

ChrisHegarty commented 1 year ago

I ran @mayya-sharipova's exact same benchmark/test on my machine. Here are the results.

Test environment

Dataset:
- nq dataset with text field embedded with OpenAI text-embedding-ada-002 model, 1536 dims
KnnGraphTester
maxConn: 16, beamWidthIndex: 100
Linux, x86_64 11th Intel Core i5-11400 @ 2.60GHz - AVX 512
JDK 20.0.1

Result

Panama(bits)	dims	time (secs)
No	1024	3136
Yes(512)	1536	2633

So the test run with 1536 dims and Panama enabled at AVX 512 was 503 secs (or ~16%) faster than the run with 1024 dims and No Panama.

Test1:

Lucene 9.7.0
Panama Vector API not enabled
vector dims=1024 (OpenAi vectors that were cut off to first 1024 dims)
Results: Indexed 2680961 documents in 3136s

Details

``` davekim$ time /home/chegar/binaries/jdk-20.0.1/bin/java -cp lucene-9.7.0/modules/*:/home/chegar/git/lucene/lucene/core/build/classes/java/test -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1024 -ndoc 2680961 -reindex -docs vector_search-open_ai_vectors_1024-vectors_dims1024.bin -maxConn 16 -beamWidthIndex 100 creating index in vector_search-open_ai_vectors_1024-vectors_dims1024.bin-16-100.index Jun 28, 2023 1:44:34 PM org.apache.lucene.store.MemorySegmentIndexInputProvider INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false MS 0 [2023-06-28T12:44:34.340877459Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-06-28T12:44:34.355786340Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@7e9a5fbe IFD 0 [2023-06-28T12:44:34.358595927Z; main]: now delete 0 files: [] IFD 0 [2023-06-28T12:44:34.359321686Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-06-28T12:44:34.359380405Z; main]: now delete 0 files: [] IFD 0 [2023-06-28T12:44:34.360606701Z; main]: 0 ms to checkpoint IW 0 [2023-06-28T12:44:34.361060247Z; main]: init: create=true reader=null IW 0 [2023-06-28T12:44:34.367050357Z; main]: dir=MMapDirectory@/home/chegar/git/lucene-vector-bench/vector_search-open_ai_vectors_1024-vectors_dims1024.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@46238e3f index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=1994.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@6c9f5c0d writer=org.apache.lucene.index.IndexWriter@de3a06f IW 0 [2023-06-28T12:44:34.367221110Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Jun 28, 2023 1:44:34 PM org.apache.lucene.util.VectorUtilProvider lookup WARNING: Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API. DWPT 0 [2023-06-28T12:53:31.591056430Z; main]: flush postings as segment _0 numDocs=460521 IW 0 [2023-06-28T12:53:31.591842896Z; main]: 0 ms to write norms IW 0 [2023-06-28T12:53:31.592260907Z; main]: 0 ms to write docValues IW 0 [2023-06-28T12:53:31.592370750Z; main]: 0 ms to write points IW 0 [2023-06-28T12:53:32.987321518Z; main]: 1394 ms to write vectors IW 0 [2023-06-28T12:53:32.997512174Z; main]: 10 ms to finish stored fields IW 0 [2023-06-28T12:53:32.997693539Z; main]: 0 ms to write postings and finish vectors IW 0 [2023-06-28T12:53:32.998159715Z; main]: 0 ms to write fieldInfos DWPT 0 [2023-06-28T12:53:32.999257618Z; main]: new segment has 0 deleted docs DWPT 0 [2023-06-28T12:53:32.999365945Z; main]: new segment has 0 soft-deleted docs DWPT 0 [2023-06-28T12:53:33.000456314Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs DWPT 0 [2023-06-28T12:53:33.000586334Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm] DWPT 0 [2023-06-28T12:53:33.000673681Z; main]: flushed codec=Lucene95 DWPT 0 [2023-06-28T12:53:33.001725500Z; main]: flushed: segment=_0 ramUsed=1,945.017 MB newFlushedSize=1,824.658 MB docs/MB=252.388 DWPT 0 [2023-06-28T12:53:33.002919290Z; main]: flush time 1412.932331 ms IW 0 [2023-06-28T12:53:33.004048349Z; main]: publishFlushedSegment seg-private updates=null IW 0 [2023-06-28T12:53:33.004702334Z; main]: publishFlushedSegment _0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjjz BD 0 [2023-06-28T12:53:33.006074639Z; main]: finished packet delGen=1 now completedDelGen=1 IW 0 [2023-06-28T12:53:33.007517182Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjjz IFD 0 [2023-06-28T12:53:33.007718974Z; main]: now checkpoint "_0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjk0" [1 segments ; isCommit = false] IFD 0 [2023-06-28T12:53:33.008114732Z; main]: now delete 0 files: [] IFD 0 [2023-06-28T12:53:33.008168685Z; main]: 0 ms to checkpoint MP 0 [2023-06-28T12:53:33.010309939Z; main]: seg=_0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjk0 size=1824.659 MB MP 0 [2023-06-28T12:53:33.010610953Z; main]: findMerges: 1 segments MP ... Indexed 2680961 documents in 3136s ```

Test2

Lucene 9.7 with FloatVectorValues.MAX_DIMENSIONS patched to a MAX_DIMENSIONS of 2048
Panama Vector API enabled preferredBitSize=512
vector dims=1536
Results: Indexed 2680961 documents in 2633s

Details

``` davekim$ time /home/chegar/binaries/jdk-20.0.1/bin/java \ --add-modules=jdk.incubator.vector \ -cp /home/chegar/git/lucene/lucene/core/build/libs/lucene-core-9.7.0-SNAPSHOT.jar:lucene-9.7.0/modules/*:/home/chegar/git/lucene/lucene/core/build/classes/java/test \ -Xmx16g -Xms16g \ org.apache.lucene.util.hnsw.KnnGraphTester \ -dim 1536 \ -ndoc 2680961 \ -reindex \ -docs vector_search-open_ai_vectors-vectors.bin \ -maxConn 16 \ -beamWidthIndex 100 WARNING: Using incubator modules: jdk.incubator.vector creating index in vector_search-open_ai_vectors-vectors.bin-16-100.index Jun 28, 2023 3:18:08 PM org.apache.lucene.store.MemorySegmentIndexInputProvider INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false MS 0 [2023-06-28T14:18:08.783226914Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-06-28T14:18:08.798094830Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1efee8e7 IFD 0 [2023-06-28T14:18:08.800639373Z; main]: now delete 0 files: [] IFD 0 [2023-06-28T14:18:08.801349082Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-06-28T14:18:08.801461676Z; main]: now delete 0 files: [] IFD 0 [2023-06-28T14:18:08.802987862Z; main]: 0 ms to checkpoint IW 0 [2023-06-28T14:18:08.803265302Z; main]: init: create=true reader=null IW 0 [2023-06-28T14:18:08.809406650Z; main]: dir=MMapDirectory@/home/chegar/git/lucene-vector-bench/vector_search-open_ai_vectors-vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1dd02175 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=1994.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@3d3fcdb0 writer=org.apache.lucene.index.IndexWriter@641147d0 IW 0 [2023-06-28T14:18:08.809591811Z; main]: MMapDirectory.UNMAP_SUPPORTED=true Jun 28, 2023 3:18:08 PM org.apache.lucene.util.VectorUtilPanamaProvider INFO: Java vector incubator API enabled; uses preferredBitSize=512 DWPT 0 [2023-06-28T14:23:17.927393364Z; main]: flush postings as segment _0 numDocs=314897 IW 0 [2023-06-28T14:23:17.928214793Z; main]: 0 ms to write norms IW 0 [2023-06-28T14:23:17.928486805Z; main]: 0 ms to write docValues IW 0 [2023-06-28T14:23:17.928593869Z; main]: 0 ms to write points IW 0 [2023-06-28T14:23:19.282981254Z; main]: 1354 ms to write vectors IW 0 [2023-06-28T14:23:19.290000600Z; main]: 6 ms to finish stored fields IW 0 [2023-06-28T14:23:19.290178853Z; main]: 0 ms to write postings and finish vectors IW 0 [2023-06-28T14:23:19.290669001Z; main]: 0 ms to write fieldInfos DWPT 0 [2023-06-28T14:23:19.291053701Z; main]: new segment has 0 deleted docs DWPT 0 [2023-06-28T14:23:19.291129515Z; main]: new segment has 0 soft-deleted docs DWPT 0 [2023-06-28T14:23:19.292160606Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs DWPT 0 [2023-06-28T14:23:19.292249403Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm] DWPT 0 [2023-06-28T14:23:19.292320403Z; main]: flushed codec=Lucene95 DWPT 0 [2023-06-28T14:23:19.295665508Z; main]: flushed: segment=_0 ramUsed=1,945.012 MB newFlushedSize=1,863.46 MB docs/MB=168.985 DWPT 0 [2023-06-28T14:23:19.296825017Z; main]: flush time 1370.228388 ms IW 0 [2023-06-28T14:23:19.297541689Z; main]: publishFlushedSegment seg-private updates=null IW 0 [2023-06-28T14:23:19.298158353Z; main]: publishFlushedSegment _0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvach BD 0 [2023-06-28T14:23:19.299549573Z; main]: finished packet delGen=1 now completedDelGen=1 IW 0 [2023-06-28T14:23:19.301085879Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvach IFD 0 [2023-06-28T14:23:19.301281180Z; main]: now checkpoint "_0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvaci" [1 segments ; isCommit = false] IFD 0 [2023-06-28T14:23:19.301666023Z; main]: now delete 0 files: [] IFD 0 [2023-06-28T14:23:19.301718781Z; main]: 0 ms to checkpoint MP 0 [2023-06-28T14:23:19.303689024Z; main]: seg=_0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvaci size=1863.460 MB MP 0 [2023-06-28T14:23:19.303936133Z; main]: findMerges: 1 segments MP .... Indexed 2680961 documents in 2633s ```

Full output from the test runs can be see here https://gist.github.com/ChrisHegarty/ef008da196624c1a3fe46578ee3a0a6c.

rmuir commented 1 year ago

Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)? We are still talking about an hour to index < 3M docs, so I think the performance is not good. As i've said before, i never thought 1024 was a good situation either. 768 is also excruciating. Purpose of the vectorization is just to alleviate some of the pain. It is like giving the patient an aspirin, it doesn't really fix the problem.

alessandrobenedetti commented 1 year ago

I am extremely curious, what should we consider a good performance to index <3M docs? I mean, I agree we should always try to improve things and aim for the stars, but as maintainers of a library who are we to decide what's acceptable and what's not for the users? Is it because of a comparison with other libraries or solutions? They may have many reasons for being faster (and definitely we should take inspiration) If we look to : https://home.apache.org/~mikemccand/lucenebench/indexing.html , we clearly improved the indexing throughput substantially over the years, does this mean that Lucene back in 2011 should have not committed additional features/improvements because for some people (people from the future) "it was slow"?

mayya-sharipova commented 1 year ago

@rmuir

Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)?

I've done the test and surprising indexing time decreased substantially. It is almost 2 times faster to index with Lucene's defaults than with 2Gb RamBuffer at the expense that we end up with a bigger number of segments.

Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048
preferredBitSize=128
Panama Vector API enabled
vector dims: 1536
num of docs: 2.68M

RamBuffer Size	Indexing time	Num of segments
16 Mb	1877 s	19
1994 Mb	3141s	9

Details

``` WARNING: Using incubator modules: jdk.incubator.vector Jul 10, 2023 3:35:25 P.M. org.apache.lucene.store.MemorySegmentIndexInputProvider INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false Jul 10, 2023 3:35:26 P.M. org.apache.lucene.util.VectorUtilPanamaProvider INFO: Java vector incubator API enabled; uses preferredBitSize=128 _fc.fdt _v6.fnm _vj.si _vr_Lucene95HnswVectorsFormat_0.vec _fc.fdx _v6.si _vj_Lucene95HnswVectorsFormat_0.vec _vr_Lucene95HnswVectorsFormat_0.vem _fc.fnm _v6_Lucene95HnswVectorsFormat_0.vec _vj_Lucene95HnswVectorsFormat_0.vem _vr_Lucene95HnswVectorsFormat_0.vex _fc.si _v6_Lucene95HnswVectorsFormat_0.vem _vj_Lucene95HnswVectorsFormat_0.vex _vs.fdm _fc_Lucene95HnswVectorsFormat_0.vec _v6_Lucene95HnswVectorsFormat_0.vex _vl.fdm _vs.fdt creating index in vectors.bin-16-100.index MS 0 [2023-07-10T14:47:25.668178Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9 IFD 0 [2023-07-10T14:47:25.725823Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c IFD 0 [2023-07-10T14:47:25.735809Z; main]: now delete 0 files: [] IFD 0 [2023-07-10T14:47:25.738456Z; main]: now checkpoint "" [0 segments ; isCommit = false] IFD 0 [2023-07-10T14:47:25.738587Z; main]: now delete 0 files: [] IFD 0 [2023-07-10T14:47:25.743719Z; main]: 2 ms to checkpoint IW 0 [2023-07-10T14:47:25.744195Z; main]: init: create=true reader=null IW 0 [2023-07-10T14:47:25.779752Z; main]: dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3 index= version=9.7.0 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer ramBufferSizeMB=16.0 maxBufferedDocs=-1 mergedSegmentWarmer=null delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit=null openMode=CREATE similarity=org.apache.lucene.search.similarities.BM25Similarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true codec=Lucene95 infoStream=org.apache.lucene.util.PrintStreamInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0 readerPooling=true perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null checkPendingFlushOnUpdate=true softDeletesField=null maxFullFlushMergeWaitMillis=500 leafSorter=null eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0 writer=org.apache.lucene.index.IndexWriter@67b467e9 IW 0 [2023-07-10T14:47:25.780320Z; main]: MMapDirectory.UNMAP_SUPPORTED=true FP 0 [2023-07-10T14:47:27.042597Z; main]: trigger flush: activeBytes=16779458 deleteBytes=0 vs ramBufferMB=16.0 FP 0 [2023-07-10T14:47:27.045564Z; main]: thread state has 16779458 bytes; docInRAM=2589 FP 0 [2023-07-10T14:47:27.049109Z; main]: 1 in-use non-flushing threads states DWPT 0 [2023-07-10T14:47:27.050859Z; main]: flush postings as segment _0 numDocs=2589 .... Indexed 2680961 documents in 1877s ```

dweiss commented 1 year ago

Leaving a higher number of segments dodges the merge costs, I think.

jpountz commented 1 year ago

This benchmark really only measures the flushing cost, as ConcurrentMergeScheduler is used, so merges run in background threads. So the improvement makes sense to me as the cost of adding vectors into a HNSW graph increases as the size of the HNSW graph increases. If we want to get a sense of the number of docs per second per core that we support with a 2GB RAM buffer vs. the 16MB default, using SerialMergeScheduler would be a better choice.

sylph-eu commented 1 year ago

Last comment is already a couple of months old, so please let me clarify the status of this initiative. If there's a chance it's going to be merged? If there's any blocker or action item that prevents one from being merged?

The context of my inquiry is that Lucene-based solutions (e.g. OpenSearch) are commonly deployed within enterprises, which makes them good candidates to experiment with vector search and commercial LLM-offerings, without deploying and maintaining specialized technologies. Max dimensionality of 1024, however, puts certain restrictions (similar thoughts are here https://arxiv.org/abs/2308.14963).

uschindler commented 1 year ago

Hi, actually this issue is already resolved, although the DEFAULT did not change (and won't change due to performance risks), see here: https://github.com/apache/lucene/pull/12436 - this PR allows users of Lucene to raise the limit (at least for HNSW codec) on codec level.

To implement (on your own risk), create your own KnnVectorsFormat and let it return a different number for getMaxDimensions(). Then construct your own codec from it and index your data.

You can do this with Lucene 9.8+

OpenSearch and Elasticsearch and Solr will have custom limits in their code (based on this approach).

uschindler commented 1 year ago

@mayya-sharipova: Should we close this issue or are there any plans to also change the default maximum? I don't think so.

MarcusSorealheis commented 1 year ago

I think we should close it for sure.

apache / lucene

Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] #11507

Test environment

Test1:

Test2

Test environment

Result

Test1:

Test2