apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.68k stars 1.04k forks source link

For near real-time search, use paged copy-on-write BitVector impl [LUCENE-1526] #2600

Closed asfimport closed 13 years ago

asfimport commented 15 years ago

SegmentReader currently uses a BitVector to represent deleted docs. When performing rapid clone (see LUCENE-1314) and delete operations, performing a copy on write of the BitVector can become costly because the entire underlying byte array must be created and copied. A way to make this clone delete process faster is to implement tombstones, a term coined by Marvin Humphrey. Tombstones represent new deletions plus the incremental deletions from previously reopened readers in the current reader.

The proposed implementation of tombstones is to accumulate deletions into an int array represented as a DocIdSet. With #2550, SegmentTermDocs iterates over deleted docs using a DocIdSet rather than accessing the BitVector by calling get. This allows a BitVector and a set of tombstones to by ANDed together as the current reader's delete docs.

A tombstone merge policy needs to be defined to determine when to merge tombstone DocIdSets into a new deleted docs BitVector as too many tombstones would eventually be detrimental to performance. A probable implementation will merge tombstones based on the number of tombstones and the total number of documents in the tombstones. The merge policy may be set in the clone/reopen methods or on the IndexReader.

Migrated from LUCENE-1526 by Jason Rutherglen, resolved Jan 24 2011 Attachments: LUCENE-1526.patch (versions: 2)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

Hopefully not much more than a few per second?

We should be careful what we measure to ensure that we're targeting the right use cases.

Requirements calling for zero latency updates (all index changes are always visible) are often in error (i.e. it's usually not a true requirement).

Right, I think testing reopening 100s of times per second isn't all that useful (most apps don't really need to do this).

I think seeing results broken out according to reopen frequency is more helpful.

Seems like almost all apps should be well served by second reopen resolution on average (with the ability to immediately reopen on demand). The only thing that would seem to need lower latency is when an automated client does an add, and then immediately does a query and needs to see it. In that case, that client could specify that they need an immediate reopen (either during the add or the query).

To prevent against accidental or intentional denial-of-service for clients that do the add + immediate query, one could also sync such clients up to the reopen frequency.

This would also provide for the clean semantics (like GData) of "once the 'update document' request returns, it's in the index", which I agree is a very convenient API semantics.

Ie, if your reopen rate is 4x per second (once every 250 msec), then you could hold all add requests coming in until the reopen completes, then return those requests.

So the API can still build the well defined semantics on top of Lucene, even if the reopen is rate limited under the hood.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" - it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment? If the disk index is optimized that means it would happen on every update, a clone of the full numDocs sized BitVector? I'm still a little unsure of how this happens.

Right. Actually is the index optimized in your tests? My current correctness testing (for the "lost deletes") isn't optimized... I'll try optimizing it.

somebody calls getReader() - they've got all the SegmentReaders for the disk segments, and each of them have BitVectors for deletions.

IW.update() gets called - the BitVector for the segment which now has a deletion is cloned, and set on a new pooled SegmentReader as its deletedSet

Actually, the IW.updateDocument call merely buffers the Term to be deleted. It does not resolve that term to the corresponding docID until the getReader (same as reopen) is called again. But it would be better if Lucene did the resolution in the FG (during the updateDocument) call; this is what #3122 will fix. This backgrounds the resolution, ie, reopen is no longer resolving all deletes in the FG.

But, yes, the clone happens on the first delete to arrive against a SegmentReader after it had been cloned in the NRT reader.

maybe IW.update() gets called a bunch more - do these modify the pooled but as-yet-unused SegmentReader? New readers in the pool? What?

Just more buffering right now, but after #3122, it will mark further bits in the already cloned vector. Ie, the clone happens only after getReader has returned a reader using that SegmentReader.

another call to getReader() comes in, and gets an IndexReader wrapping the pooled SegmentReaders.

Each SegmentReader is cloned, and referenced by the reader returned by getReader. And then the next delete to arrive to thse segments will force the bit vector to clone.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So we re-ran some of our tests last night, commenting out our deleted check to measure it's cost in the most extreme case possible: a dead easy query (in that it's only one term), but one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased in our code, and is perfectly fast, so not a good worst case to check), and as the index grows up above a million documents, zoie could shave somewhere from 22-28% of its time off by not doing the extra check.

OK, thanks for running that test...

So in the worst case (dead easy query, matching many many docs) Zoie's search slowdown is 22-28%. It's presumably quite a bit less (approaching zero) for hard queries that match few docs. So the search slowdown is app dependent.

I think it'd be possible (though, complex!) to do a hybrid approach. Meaning you use Zoie to get the insanely fast reopen, but, to avoid the search slowdown, in the background you convert the buffered UIDs to the docID bit vector, such that once all conversion is done, you stop checking the int set.

I guess you'd have to throttle the conversion so that in the unnatural (100s per sec) reopen test, with many queries in flight at once, you don't exhaust the heap.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird - that means something else is up, besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which #2390 addresses).

Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true), right? Without having #2390, this kills the indexing performance if querying is going on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() (that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable] and then writing segments to disk with addIndexesNoOptimize() periodically).

It'll be great if #2390 nets us a 10X improvement in indexing rate. With the improvements to benchmark (#3126), I'm hoping this'll be easy to confirm...

Ahh I see, so with very rare reopens, Zoie's indexing rate is also slower than Lucene's (because of the double buffering).

So the big picture tradeoff here is Zoie has wicked fast reopen times, compared to Lucene, but has slightly slower (10%) indexing rate, and slower searches (22-28% in the worst case), as the tradeoff.

It seems like we need to find the "break even" point. Ie, if you never reopen, then on fixed hardware, Lucene is faster at indexing and searching than Zoie. If you reopen at an insane rate (100s per sec), Zoie is much faster than Lucene on both indexing and searching. But what if you reopen 2x, 1x per second? Once every 2 seconds, etc. At some point the crossover will happen.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Due to the bloomfilter living on top of the hashSet, at least at the scales we're dealing with, we didn't see any change in cost due to the number of deletions (zoie by default keeps no more than 10k modifications in memory before flushing to disk, so the biggest the delSet is going to be is that, and we don't see the more-than-constant scaling yet at that size).

Blooom filters are nice :)

But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc.

Yep, this is true. It's a little more invasive to put this into Zoie, because the reopen time is so fast that there's no pooling, so it would need to be kinda hacked in, or tacked on to the outside. Not rocket science, but not just the change of a parameter.

OK. It's clear Zoie's design is optimized for insanely fast reopen.

3126 should make it easy to test this for pure Lucene NRT.

LinkedIn doesn't have any hard requirements of having to reopen hundreds of times per second, we're just stressing the system, to see what's going on.

Redline tests are very important, to understand how the system will behave at extremes.

But I think it'd be useful to controll which dimension to redline.

EG what I'd love to see is, as a function of reopen rate, the "curve" of QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of your machine's resources. What's left can be spent indexing or searching or both, so, it's a curve/line. So we should set up fixed rate indexing, and then redline the QPS to see what's possible, and do this for multiple indexing rates, and for multiple reopen rates.

Then this all becomes a capacity question for apps.

As you can see, nobody's filing a bug here that Lucene NRT is "broken" because it can't handle zero-latency updates.

Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second.

(Of course we will need to test, with #3126, at what reopen frequency you really eat into your indexing/searching performance, given fixed hardware).

What we did try to make sure was in the system was determinism: not knowing whether an update will be seen because there is some background process doing addIndexes from another thread which hasn't completed, or not knowing how fresh the pooled reader is, that kind of thing.

This kind of determinism can certainly be gotten with NRT, by locking down the IndexWriter wrapped up in another class to keep it from being monkeyed with by other threads, and then tuning exactly how often the reader is reopened, and then dictate to clients that the freshness is exactly at or better than this freshness timeout, sure. This kind of user-friendliness is one of Zoie's main points - it provides an indexing system which manages all this, and certainly for some clients, we should add in the ability to pool the readers for less real-timeness, that's a good idea.

I agree – having such well defined API semantics ("once updateDoc returns, searches can see it") is wonderful. But I think they can be cleanly built on top of Lucene NRT as it is today, with a pre-determined (reopen) latency.

Of course, if your reopen() time is pretty heavy (lots of FieldCache data / other custom faceting data needs to be loaded for a bunch of fields), then at least for us, even not needing zero-latency updates means that the more realistically 5-10% degredation in query performance for normal queries is negligable, and we get deterministic zero-latency updates as a consequence.

I think the "large merge just finished" case is the most costly for such apps (which the "merged segment warmer" on IW should take care of)? (Because otherwise the segments are tiny, assuming everything is cutover to "per segment").

This whole discussion reminded me that there's another realtime update case, which neither Zoie nor NRT is properly optimized for: the absolutely zero deletes case with very fast indexing load and the desire for minimal latency of updates (imagine that you're indexing twitter - no changes, just adds), and you want to be able to provide a totally stream-oriented view on things as they're being added (matching some query, naturally) with sub-second turnaround. A subclass of SegmentReader which is constructed which doesn't even have a deletedSet could be instantiated, and the deleted check could be removed entirely, speeding things up even further.

I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging ;)

asfimport commented 15 years ago

Jake Mannix (migrated from JIRA)

OK. It's clear Zoie's design is optimized for insanely fast reopen.

That, and maxing out QPS and indexing rate while keeping query latency degredation to a minimum. From trying to turn off the extra deleted check, the latency overhead on a 5M doc index is a difference of queries taking 12-13ms with the extra check turned on, and 10ms without it, and you only really start to notice on the extreme edges (the queries hitting all 5million docs by way of an actual query (not MatchAllDocs)), when your performance goes from maybe 100ms to 140-150ms.

EG what I'd love to see is, as a function of reopen rate, the "curve" of QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of your machine's resources. What's left can be spent indexing or searching or both, so, it's a curve/line. So we should set up fixed rate indexing, and then redline the QPS to see what's possible, and do this for multiple indexing rates, and for multiple reopen rates.

Yes, that curve would be a very useful benchmark. Now that I think of it, it wouldn't be too hard to just sneak some reader caching into the ZoieSystem with a tunable parameter for how long you hang onto it, so that we could see how much that can help. One of the nice things that we can do in Zoie by using this kind of index-latency backoff, is that because we have an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the background, since we're caching these readers now) to zip through and update the real delete BitVectors on the segments, and lose the extra check at query time, only using that if you have the index-latency time set below some threshold (determined by how long it takes the system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little slower).

Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second.

In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's just going way faster in both indexing and querying in the redline perf test. I agree that in principle, once #2390 and other improvements and bugs have been worked out of NRT, that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy) is in use for NRT, the indexing performance should be faster too - it's just not quite there yet at this point.

I agree - having such well defined API semantics ("once updateDoc returns, searches can see it") is wonderful. But I think they can be cleanly built on top of Lucene NRT as it is today, with a pre-determined (reopen) latency.

Of course! These api semantics are already built up on top of plain-old Lucene - even without NRT, so I can't imagine how NRT would remove this ability! :)

I think the "large merge just finished" case is the most costly for such apps (which the "merged segment warmer" on IW should take care of)? (Because otherwise the segments are tiny, assuming everything is cutover to "per segment").

Definitely. One thing that Zoie benefited from, from an API standpoint, which might be nice in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader with a warmed user-specified subclass of SegmentReader:

public abstract class IndexReaderWarmer<R extends IndexReader> {
  public abstract T warm(IndexReader rawReader);
}

Which could replace the reader in the readerPool with the possibly-user-overridden subclass of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has now been warmed. For users who like to decorate their readers to keep additional state, instead of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie, to name a few examples off the top of my head).

I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's necessary - if a single box can handle indexing a couple hundred smallish documents a second (into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and you can search against them without having to do any deduplification / bloomfilter check against the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much of a speedup in comparison to just doing it the simple way. But I could be wrong, as I'm not sure how much faster such a search could be.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One of the nice things that we can do in Zoie by using this kind of index-latency backoff, is that because we have an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the background, since we're caching these readers now) to zip through and update the real delete BitVectors on the segments, and lose the extra check at query time, only using that if you have the index-latency time set below some threshold (determined by how long it takes the system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little slower).

Right – I think such a hybrid approach would have the best tradeoffs of all. You'd get insanely fast reopen, and then searching would only take the performance hit until the BG resolution of deleted UID -> Lucene docID completed. Similar to the JRE's BG hotspot compiler.

Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second.

In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's just going way faster in both indexing and querying in the redline perf test. I agree that in principle, once #2390 and other improvements and bugs have been worked out of NRT, that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy) is in use for NRT, the indexing performance should be faster too - it's just not quite there yet at this point.

Well.. unfortunately, we can't conclude much from the current test, besides that Zoie's reopen time is much faster than Lucene's (until/if we add the "reopen frequency" as a dimension, and see those results).

Also the test is rather synthetic, in that most apps don't really need to reopen 100s of times per second. We really should try to test more realistic cases.

One question: where is CPU utilization when you run the Lucene test? Presumably, if you block an incoming query until the reopen completes, and because only one reopen can happen at once, it seems like CPU must not be saturated?

But, I agree, there are alot of moving parts here still – Zoie has far faster add-only throughput than Lucene (could simply be due to lack of LUCENE-1313), Lucene may have correctness issue (still can't repro), Lucene has some pending optimizations (#3122), etc.

In #3137 I'm working on a standard benchmark we can use to test improvements to Lucene's NRT; it'll let us assess potential improvements and spot weird problems.

One thing that Zoie benefited from, from an API standpoint, which might be nice in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader with a warmed user-specified subclass of SegmentReader:
public abstract class IndexReaderWarmer<R extends IndexReader> {
public abstract T warm(IndexReader rawReader);
}
Which could replace the reader in the readerPool with the possibly-user-overridden subclass of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has now been warmed. For users who like to decorate their readers to keep additional state, instead of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie, to name a few examples off the top of my head).

This is a good idea, and it's been suggested several times now, including eg notification when segment merging starts/commits, but I think we should take it up in the larger context of how to centralize reader pooling? This pool is just the pool used by IndexWriter, when its in NRT mode; I think IndexReader.open should somehow share the same infrastructure. And maybe #3101 (refactoring IW) is the vehicle for "centralizing" this? Can you go carry over this suggestion there?

I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's necessary - if a single box can handle indexing a couple hundred smallish documents a second (into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and you can search against them without having to do any deduplification / bloomfilter check against the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much of a speedup in comparison to just doing it the simple way. But I could be wrong, as I'm not sure how much faster such a search could be.

Right, we should clearly only take such a big step if performance shows it's justified. From the initial results I just posted in

3137, it looks like Lucene does in fact handle the add-only

case very well (ie degredation to QPS is fairly contained), even into an FSDir. I need to restest with #2390.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We should test the performance tradeoffs incurred by switching to transactional data structure (like the proposed paged bit vector), but... my inclination at this point would be it's not a good tradeoff for Lucene NRT to make.

Ie, it'd be making the same tradeoff Zoie now makes – faster reopen time for slower searching, which I don't think makes sense for most apps.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Here's a working version of this. The page size is statically configurable by adjusting CONST in PagedBitVector. I set it on the high side because the next thing is to inline the page and doc checking into SegmentTermDocs for benchmarking.

The test is fairly randomized, though I think there's more that can be added.

The pages are saved one by one, either as dgaps or bytes, which means the .del file format has changed. We can probably read the old format, write the new format if this is deployed.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Inlined into SegmentTermDocs. If there's an issue with the del docs null check we could go extreme and instantiate specialized instances of SegTD that doesn't perform the check. I'm not sure what would slow this down, but profiling will lets us know what's up.

TestIndexReaderReopen and TestIndexWriterReader passes so I figure we're ready for benchmarking.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Jake, have you guys had a chance to re-run your tests across varying reopen rates? Are you still hitting OOM / file handle leaks with straight Lucene NRT? I've been unable to reproduce these issues in my stress testing.... so I'd like to hone in on what's different in our testing.

asfimport commented 14 years ago

John Wang (migrated from JIRA)

Yes, we still see the issue. The performance/stress test after 20+ min of run, latency spiked from 5ms to 550ms and file handle leakage was severe enough that the test crashed. This is the code:

http://code.google.com/p/zoie/source/browse/branches/BR_DELETE_OPT/java/proj/zoie/impl/indexing/luceneNRT/ThrottledLuceneNRTDataConsumer.java

Our logging indicates there is at most 3 index readers instances at open state. Yet the file handle count is very high.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yes, we still see the issue

OK I'll open a separate issue to try to get to the bottom of this...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I spunoff #3196.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

John, what about memory exhaustion? Are you still hitting that as well?

asfimport commented 13 years ago

Jason Rutherglen (migrated from JIRA)

Won't be working on these and they're old