Optional Buffer Pool to Improve Search Performance [LUCENE-1035]

asfimport commented 17 years ago

Index in RAMDirectory provides better performance over that in FSDirectory. But many indexes cannot fit in memory or applications cannot afford to spend that much memory on index. On the other hand, because of locality, a reasonably sized buffer pool may provide good improvement over FSDirectory.

This issue aims at providing such an optional buffer pool layer. In cases where it fits, i.e. a reasonable hit ratio can be achieved, it should provide a good improvement over FSDirectory.

Migrated from LUCENE-1035 by Ning Li, 1 vote, resolved Jan 27 2011 Attachments: LUCENE-1035.contrib.patch, LUCENE-1035.patch

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

Coding Changes

New classes are localized to the store package and so as most of the changes.

Two new interfaces: BareInput and BufferPool.
BareInput takes a subset of IndexInput's methods such as readBytes (IndexInput now implements BareInput).
BufferPoolLRU is a simple implementation of BufferPool interface. It uses a doubly linked list for the LRU algorithm.
BufferPooledIndexInput is a subclass of BufferedIndexInput. It takes a BareInput and a BufferPool. For BufferedIndexInput's readInternal, it will read from the BufferPool, and BufferPool will read from its cache if it's a hit and read from BareInput if it's a miss.
A FSDirectory object can optionally be created with a BufferPool with its size specified by a buffer size and number of buffers. BufferPool is shared among IndexInput of read-only files in the directory.

Unit tests

TestBufferPoolLRU.java is added.
Minor changes were made to _TestHelper.java and TestCompoundFile.java because they made specific assumptions of the type of IndexInput returns by FSDirectory.openInput.
All unit tests pass when I switch to always use a BufferPool.

Performance Results

I ran some experiments using the enwiki dataset. The experiments were run on a dual 2.0Ghz Intel Xeon server running Linux. The dataset has about 3.5M documents and the index built from it is more than 3G. The only store field is a unique docid which is retrieved for each query result. All queries are two-term AND queries generated from the dictionary. The first set of queries returns between 1 to 1000 results with an average of 40. The second set returns between 1 to 3000 with an average of 560. All tests were run warm.

1 Query set with average 40 results Buffer Pool Size Hit Ratio Queries per second 0 N/A 230 16M 55% 250 32M 63% 282 64M 73% 345 128M 85% 476 256M 95% 672 512M 98% 685

2 Query set with average 560 results Buffer Pool Size Hit Ratio Queries per second 0 N/A 27 16M 56% 29 32M 70% 37 64M 89% 55 128M 97% 67 256M 98% 71 512M 99% 72

Of course if the tests are run cold, or if the queried portion of the index is significantly larger than the file system cache, or there are a lot of pre-processing of the queries and/or post-processing of the results, the speedup will be less. But where it applies, i.e. a reasonable hit ratio can be achieved, it should provide a good improvement.

asfimport commented 17 years ago

robert engels (migrated from JIRA)

I don't think this is any better than the NIOFileCache directory I had already submitted.

It not really approved because the community felt that it did not offer much over the standard OS file system cache.

My tests showed it was better, but I think this would fall into the same problem.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Were the tests run using the same set of queries they were warmed for? If so, an interesting benchmark might be to, e.g., start with 200 queries, then warm things with the first 100 and use the second for the benchmark. Ideally you'd start with a log of real queries, but those are hard to obtain. Over ten years ago I released a 1M query log from Excite, which I still see people reference in papers, so it must be out there somewhere. It would be better than nothing for these kinds of benchmarks. Or perhaps we can obtain a copy of the more-recent AOL query log? Otherwise you've only demonstrated an improvement when queries are frequently repeated. There are better ways to optimize for that, e.g., by caching hit lists, no?

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> I don't think this is any better than the NIOFileCache directory I had already submitted.

Are you referring to #1492? I just read it and yes, it's similar to the MemoryLRUCache part. Hopefully this is more general, not just for NioFile.

> It not really approved because the community felt that it did not offer much over the standard OS file system cache.

Well, it shows it has its value in cases where you can achieve a reasonable hit ratio, right? This is optional. An application can test with its query log first to see the hit ratio and then decide whether to use a buffer pool and if so, how large.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

A couple of random thoughts

previous tests showed that vint decoding was often bottleneck, but these tests seem to indicate otherwise (that the bottleneck is the system call to move data from OS FS cache to userspace?)... perhaps this is due to the fact that all queries are "AND" and match a max of 1000 docs? The recent addition of multi-level skipping perhaps removes the vint decoding bottleneck for these types of queries that match few documents.
most lucene usecases store much more than just the document id... that would really affect locality.
It seems like a simple LRU cache could really be blown out of the water by certain types of queries (retrieve a lot of stored fields, or do an expanding term query) that would force out all previously cached hotspots. Most OS level caching has protection against this (multi-level LRU or whatever). But of our user-level LRU cache fails, we've also messed up the OS level cache since we've been hiding page hits from it.
I'd like to see single term queries, "OR" queries, and queries across multiple fields (also a common usecase) that match more documents tested also.

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> Were the tests run using the same set of queries they were warmed for?

Yes, the same set of queries were used. The warm-up and the real run are two separate runs, which means the file system cache is warmed, but not the buffer pool.

Yes, it'd much better if a real query log could be obtained. I'll take a look at the AOL query log. I used to have an intranet query log which has a lot of term locality. That's why I think this could provide a good improvement.

> There are better ways to optimize for that, e.g., by caching hit lists, no?

That's useful and that's for exact query match. If there are a lot of shared query term but not exact query match, caching hit list won't help. This is sort of caching posting list but simpler.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Also, in addition to some kind of protection against the LRU cache being busted by a single query, perhaps the ability to not cover parts of the index (like stored fields) would also help.

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> most lucene usecases store much more than just the document id... that would really affect locality.

In the experiments, I was simulating the (Google) paradigm where you retrieve just the docids and go to document servers for other things. If store almost always negatively affects locality, we can make the buffer pool sit only on data/files which we expect good locality (say posting lists), but not others.

> It seems like a simple LRU cache could really be blown out of the water by certain types of queries (retrieve a lot of stored fields, or do an expanding term query) that would force out all previously cached hotspots. Most OS level caching has protection against this (multi-level LRU or whatever). But of our user-level LRU cache fails, we've also messed up the OS level cache since we've been hiding page hits from it.

That's a good point. We can improve the algorithm but hopefully still keep it simple and general. This buffer pool is not a fit-all solution. But hopefully it will benefit a number of use cases. That's why I say "optional". :)

> I'd like to see single term queries, "OR" queries, and queries across multiple fields (also a common usecase) that match more documents tested also.

I'll change to "OR" queries and see what happens. The dataset is enwiki with four fields: docid, date (optional), title and body. Most terms are from title and body.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Actually, phrase queries would be really interesting too since they hit the term positions.

asfimport commented 17 years ago

Eks Dev (migrated from JIRA)

did you compare it against MMAP? I

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> I'll change to "OR" queries and see what happens.

Query set with average 590K results, retrieving docids for the first 5K Buffer Pool Size Hit Ratio Queries per second 0 N/A 1.9 16M 53% 1.9 32M 68% 2.0 64M 90% 2.3 128M/256M/512M 99% 2.3

As Yonik pointed out, in the previous "AND" tests, the bottleneck is the system call to move data from file system cache to userspace. Here in the "OR" tests, much fewer such calls are made therefore the speedup is less significant. Wish I could get a real query workload for this dataset.

> Actually, phrase queries would be really interesting too since they hit the term positions.

Phrase queries are rare and term distribution is highly skewed according to the following study on the Excite query log: Spink, Amanda and Xu, Jack L. (2000) "Selected results from a large study of Web searching: the Excite study". Information Research, 6(1) Available at: http://InformationR.net/ir/6-1/paper90.html

"4. Phase Searching: Phrases (terms enclosed by quotation marks) were seldom, while only 1 in 16 queries contained a phrase - but correctly used.

Search Terms: Distribution: Jansen, et al., (2000) report the distribution of the frequency of use of terms in queries as highly skewed."

I didn't find a good on on the AOL query log. In any case, this buffer pool is not intended for general purpose. I mentioned RAMDirectory earlier. This is more like an alternative to RAMDirectory (that's why it's per directory): you want persistent storage for the index, yet it's not too big that you want RAMDirectory search performance. In addition, the entire index doesn't have to fit into memory, as long as the most queried part does. Hopefully, this benefits a subset of Lucene use cases.

> did you compare it against MMAP? I

The index I experimented on didn't fit in memory...

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

>> 4. Phase Searching: Phrases (terms enclosed by quotation marks) were seldom, while only 1 in 16 queries contained a phrase

quoted phrases in raw user input may be rare, but that does't mean PhraseQueries are as rare ... apps may artificially create a sloppy PhraseQuery containing all of the individual words in the users raw query string to help identify matches where the input words all appear close together (i may be bias in assuming this is common, since it's something i do a lot of personally)

asfimport commented 17 years ago

Mike Klaas (migrated from JIRA)

> Query set with average 590K results, retrieving docids for the first 5K

That seems like quite a few docs to retrieve--any particular reason why? (It would be good to know if the speedup is occuring in the query phase or doc retrieval). This would also explain why VInt decoding is not the bottleneck (it wouldn't be much-used for retrieving stored fields).

I echo Hoss' comment--proximity searching is important even if it isn't used much directly by users.

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> That seems like quite a few docs to retrieve--any particular reason why?

I was guessing most applications won't want all 590K results, no? Lucene is used in so many different ways. No represent-all use case.

> I echo Hoss' comment--proximity searching is important even if it isn't used much directly by users.

Hmm, I agree with you and Hoss, especially in applications where proximity is used to rank results of OR queries.

asfimport commented 16 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Ning, I didn't mean to sound negative about this. Your benchmarks do show that in some situations this can provide significant speedup. The question is whether such situations are common enough to warrant adding this to the core. A way around that might be to layer it on top of FSDirectory and add it to contrib.

asfimport commented 16 years ago

robert engels (migrated from JIRA)

Again, see my previous code in issue 414. That it only works NioFile is not really a limitation, it can easily work with any underlying "file". This is just an implementation detail.

This code is already implemented as a layer on top of FS directory, so the caller can decide to use an original FS directory or a caching one.

We actually have a multiplexing directory that (depending on file type and size), either opens the file purely in memory, uses a cached file, or lets the OS do the caching. Works really well.

asfimport commented 16 years ago

Ning Li (migrated from JIRA)

> The question is whether such situations are common enough to warrant adding this to the core.

Agree.

> A way around that might be to layer it on top of FSDirectory and add it to contrib.

I'd be happy to do that. I'll also include the following in the javadoc which hopefully is a fair assessment:

"When will a buffer pool help:

When an index is significantly larger than the file system cache, the hit ratio of a buffer pool is probably low which means insignificant performance improvement.
When an index is about the size of the file system cache or smaller, a buffer pool with good enough hit ratio will help if the IO system calls are the bottleneck. An example is if you have many "AND" queries which causes a lot large skips."

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

It looks like this was never fully done. I wonder if this should be closed, esp. since Ning might be working on slightly different problems now.

asfimport commented 16 years ago

Eks Dev (migrated from JIRA)

Robert, you said: ....We actually have a multiplexing directory that (depending on file type and size), either opens the file purely in memory, uses a cached file, or lets the OS do the caching. Works really well...

Did you create a patch somewhere, or is this your internal work?

I have a case where this could come in very handy, I plan to use MMAP for postings & co... but FSDirectory for stored fields as they could easily blow the size ... With possibility to to select on file type/size makes MMAP use case much much closer to many users... one Directory implementation that allows users to select strategy is indeed perfect, LRU, FSDirectora, MMAP, RAM or whatnot

asfimport commented 16 years ago

Ning Li (migrated from JIRA)

> It looks like this was never fully done. I wonder if this should be closed, esp. since Ning might be working on slightly different problems now.

Sorry for the delay. I'll spend some time later this week or early next week to update and make it a contrib patch.

asfimport commented 16 years ago

Ning Li (migrated from JIRA)

Re-do as a contrib package. Creating BufferPooledDirectory with your customized file name filter for readers allows you to decide which files you want to use the caching layer for.

The package includes some tests. I also modified and tested the core tests with the caching layer in a private setting and all tests passed.

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

It seems there was never consensus to carry on with this issue, mostly because the gains are not too clear (e.g., if you have AND queries, this directory may not be such a big win, and I think AND is used a lot more today by default?) or if your index is bigger than the FS cache, which I think it also something that's becoming more and more relevant these days – we see lot more biggish indexes than small ones.

Since there was no activity in the past 2.5 years, I'm closing it. If at some point someone will think that this is worth reopening, please do.

apache / lucene

Optional Buffer Pool to Improve Search Performance [LUCENE-1035] #2111