apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

For near real-time search, use paged copy-on-write BitVector impl [LUCENE-1526] #2600

Closed asfimport closed 13 years ago

asfimport commented 15 years ago

SegmentReader currently uses a BitVector to represent deleted docs. When performing rapid clone (see LUCENE-1314) and delete operations, performing a copy on write of the BitVector can become costly because the entire underlying byte array must be created and copied. A way to make this clone delete process faster is to implement tombstones, a term coined by Marvin Humphrey. Tombstones represent new deletions plus the incremental deletions from previously reopened readers in the current reader.

The proposed implementation of tombstones is to accumulate deletions into an int array represented as a DocIdSet. With #2550, SegmentTermDocs iterates over deleted docs using a DocIdSet rather than accessing the BitVector by calling get. This allows a BitVector and a set of tombstones to by ANDed together as the current reader's delete docs.

A tombstone merge policy needs to be defined to determine when to merge tombstone DocIdSets into a new deleted docs BitVector as too many tombstones would eventually be detrimental to performance. A probable implementation will merge tombstones based on the number of tombstones and the total number of documents in the tombstones. The merge policy may be set in the clone/reopen methods or on the IndexReader.


Migrated from LUCENE-1526 by Jason Rutherglen, resolved Jan 24 2011 Attachments: LUCENE-1526.patch (versions: 2)

asfimport commented 15 years ago

Marvin Humphrey (migrated from JIRA)

> A tombstone merge policy needs to be defined to determine when to > merge tombstone DocIdSets into a new deleted docs BitVector as too > many tombstones would eventually be detrimental to performance.

For Lucene, I think the SegmentReader should lazily create an internal structure to hold the deleted doc IDs on the first search. It would either be an integer array (if there are few deletions) or a BitVector (if there are many), and it would be created by recording the output of a priority queue merging multiple tombstone streams.

Subsequent calls would not require the priority queue, but would use an iterator wrapper around the shared int array / BitVector.

For Lucy/KS, if we are to stick with the "cheap IndexReader" model, we'll want to keep using the priority queue. Thus, it will be important to keep the deletion rate for the whole index down, in order to minimize priority queue iteration costs. One possibility is to have the default merge policy automatically consolidate any segment as soon as its deletion rate climbs over 10%. That's pretty aggressive, but it keeps the search-time deletions iteration costs reasonably low, and it's in the spirit of "few writes, many reads".

> A probable implementation will merge tombstones based on the number of > tombstones and the total number of documents in the tombstones. The merge > policy may be set in the clone/reopen methods or on the IndexReader.

Would it make sense to realize a new integer array / BitVector on the first search after any new tombstone rows are added which affect the segment in question?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

For Lucene, I think the SegmentReader should lazily create an internal structure to hold the deleted doc IDs on the first search.

This is basically doing the copy-on-write, which for realtime search we're wanting to avoid. But as long as this is a sparse structure (sorted list of deleted docIDs, assuming not many deletes accumulate in RAM) it should be OK.

I also think for Lucene we could leave the index format unchanged (which means commit() is still more costly than it need be, but I'm not sure that's too serious), and use tombstones/list-of-sorted-docIDs representation only in RAM.

For realtime search, I think we can accept some slowdown of search performance in exchange for very low latency turnaround when adding/deleting docs.

But I think these decisions (the approach we take here) is very much dependent on what we learn from the performance tests from

2550.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also, an in-RAM binary tree representation can be nicely transactional (there was a recent thread about java-dev about this), so that on clone we would make a very low cost clone of the deleted docs which we could then change w/o affecting the original. Ie the 'copy on write' is still done, but it's not all done up front – each insert would copy only the nodes it needs, so the cost is amortized.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In addition to deletions, we also eventually need an incremental-copy-on-write data structure for norms. Full copy-on-write for norms (which #2391 will enable) is even more costly than full copy-on-write for deletions since each norm is 1 byte vs 1 bit for deletions.

Though, it seems less common the people are setting norms, so it's probably lower priority.

But I can still see "real-time norm changing" as being quite useful, eg if you want to bias normal relevance sorting to mixin some measure of popularity (eg recording how often each document is clicked), it'd be good to have real-time norm updating for that.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

> we also eventually need an incremental-copy-on-write data structure for norms.

I'd rather do realtime incremental column stride fields in general with norms being a specific case.

> if you want to bias normal relevance sorting to mixin some measure

This seem to fit in with allowing alternative search algorithms to TF/IDF.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

For realtime search, I think we can accept some slowdown of search performance in exchange for very low latency turnaround when adding/deleting docs.

I'm still creating performance tests, however the system is simply using skipto, to obtain the next deleted doc, the next deleted doc is cached in segmenttermdocs. So there shouldn't be a slowdown?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'd rather do realtime incremental column stride fields in general with norms being a specific case.

That would be even better!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm still creating performance tests, however the system is simply using skipto, to obtain the next deleted doc, the next deleted doc is cached in segmenttermdocs. So there shouldn't be a slowdown?

I was actually thinking more generally that this is the natural tradeoff one makes with realtime search. EG flushing a new segment every document or two will necessarily give worse indexing throughput than bulk indexing w/ large RAM buffer, but gives much faster turnaround on searching.

But it sounds like you're talking specifically about performance of switching to iterator access to deleted docs. I think the larger number of [harder-for-cpu-to-predict] if statements may be the cause of the slowdown once %tg deletes gets high enough? Also, if the underlying skipTo is a linear scan over a non-sparse representation then that's further cost.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't think we should block 2.9 for this.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

This issue has somewhat changed course from relying on DocIdSet to having a BitVector implementation that is backed by a byte[][] instead of a byte[]. This will allow modification of parts of the array without allocating the entire array for clone/copy-on-write. This saves on byte[] allocation for large segments.

We'll also need to benchmark vs. BitVector byte[].

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

This moves us on our way more efficient memory usage in heavy near realtime search apps, because we don't have to reallocate an entire byte[] equals to the maxDoc of the segment(s) that have new deletes.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This looks like good progress! So it's a copy-on-write, by fixed page size (8 kbits, by default), BitVector.

One danger of MultiBitVector is if one does a deletion against an already cloned instance, right? Because each instance only tracks refs back to the instance it was cloned from, not forward refs of other instances that have cloned it? So a clone of myself would incorrectly see changes that I make.

That said, Lucene's internal use shouldn't ever do that – when we clone the deleted docs, the previous instance should never be touched again (the "write lock" moves to the new clone, just like when reopening a reader that's holding the write lock). Can you add assertions into MultiBitVector to verify this? And explain in javadocs that once you clone it, it's frozen.

Also, I don't think we should force SegmentReader to always use MBV, unless we're sure the perf hit is negligible? Can we somehow conditionalize that?

What remains here? Ie what tests fail & why? (Or, why isn't it committable?). If you can get it to a committable state, I can run some perf tests...

In #2532, the new flex API uses a simple interface (called "Bits") to represent docs that should be skipped, and when you ask for the DocsEnum, you pass in your "Bits skipDocs". This will be important for #2610, but also important for this issue because it'll make swapping in different Bits impls easy.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Changing summary to match evolution of this issue...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another approach we might take here is to track new deletions using a sorted tree. New deletions insert in O(log(N)) time, and then we can efficiently expose a DocIdSetIterator from that.

Then, every search would have this (if present) folded in as a "not" clause/filter.

Only near real-time readers would have this.

And then, somehow, periodically that tree would have to be merged in to the real deleted docs if it ever got to big, or, it was time to commit.

This is a biggish change because suddenly IndexSearcher must be aware that a given SegmentReader is carrying near-real-time deletions, and build a BooleanQuery each time. Whereas MultiBitVector nicely keeps all this under-the-hood.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

Another approach we might take here is to track new deletions using a

sorted tree. New deletions insert in O(log(N)) time, and then we can efficiently expose a DocIdSetIterator from that.

We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure).

Just a note from our experience over in zoie-land. It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure).

Good, real-world feedback – thanks! This sounds like a compelling approach.

So the SegmentReader still had its full BitVector, but your OpenIntSet (what exactly is that?) + the bloom filter is then also checked when you enum the TermDocs? It's impressive this is fast enough... do you expect this approach to be faster than the paged "copy on write" bit vector approach?

It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.

The flex API should make this possible, without overriding TermDocs (just expose the Bits interface).

asfimport commented 14 years ago

John Wang (migrated from JIRA)

The issue of not using a BitSet/BitVector is not simply performance, but also memory cost. In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet. For deleted checks, it is usually of this pattern:

Iterate thru my docs, for each doc i isDeletedCheck(doc i)

So the cost is not really iterating the deleted set per say, it is the check.

We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives.

Code is at:

http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java

Feel free to play with it and run some perf numbers on your data.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The issue of not using a BitSet/BitVector is not simply performance, but also memory cost.

But don't you cutover all future searches to the newest NRT reader right away? Ie, those large bit vectors are all transient, so net mem cost at any given time should be well contained?

Or... do you keep many readers (against different commit points) around for a longish time (longer than just allowing for the in-flight searches to complete)?

In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet.

I agree... it's done so "contains" is fast. But we had looked into "being sparse" (and using an iterator to "and not" the deleted docs, at the TermDocs level) and the performance was quite a bit worse...

For deleted checks, it is usually of this pattern: Iterate thru my docs, for each doc i isDeletedCheck(doc i)

So the cost is not really iterating the deleted set per say, it is the check.

Right.

We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives.

This makes sense, and it's a nice solution.

But how do you make this transactional? Do you just make a full clone of IntSetAccelerator (& the underling int set) on every reopen?

Also, what policy/approach do you use to periodically fold the deletes back into the SegmentReader? As you accumulate more and more deletes in the IntSet, cloning becomes more costly.

Code is at:

http://code\.google\.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator\.java

Feel free to play with it and run some perf numbers on your data.

Thanks!

asfimport commented 14 years ago

John Wang (migrated from JIRA)

We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach.

We looked at trade-offs between this vs. our approach, for us, it was a worth-while trade-off.

We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader.

We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach.

Actually a 10 M doc index would be 1.25 MB BitVector right?

But, I agree it's wasteful of space when deletes are so sparse... though it is fast.

So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general.

Have you compared performance of this vs straight lookup in BitVector?

We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader.

OK, I think I'm catching up here... so you only open a new reader at the batch boundary right? Ie, a batch update (all its adds & deletes) is atomic from the readers standpoint?

We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.

OK so a batch is quickly reopened, using bloom filter + int set for fast "contains" check for the deletions that occurred during that batch (and, custom TermDocs that does the "and not deleted"). This gets you your fast turnaround and decent search performance.

In the BG the deletes are applied to Lucene "for real".

This is similar to Marvin's original tombstone deletions, in that the deletions against old segments are stored with the new segment, rather than aggressively pushed to the old segment's bit vectors.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Alas, I'm confused again. If your reader searching a given batch is holding deletions against past segments, you can you make a TermDocs that filters them? (The TermDocs only enumerates the current batch's docs).

Or: do you make the IntSetAccelerator for each past segments that received deletions in the current batch?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Alas, I'm confused again. If your reader searching a given batch is holding deletions against past segments, you can you make a TermDocs that filters them? (The TermDocs only enumerates the current batch's docs).

Or: do you make the IntSetAccelerator for each past segments that received deletions in the current batch?

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

But, I agree it's wasteful of space when deletes are so

sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).

So are you using this, only, as your deleted docs? Ie you don't store

the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general.

These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader. It's a copy-on-read ThreadLocal.

So I'm not sure if that described it correctly: only the deletes which should have been applied to the diskReader are treated separately - those are basically batched: for T amount of time or D amount of docs (configurable) whichever comes first, they are applied to the diskReader, which knows about Lucene's regular deletions and now these new ones as well. Once the memory is flushed to disk, the in-memory delSet is emptied, and applied to the diskReader using regular apis before reopening.

OK, I think I'm catching up here... so you only open a new reader at

the batch boundary right? Ie, a batch update (all its adds & deletes) is atomic from the readers standpoint?

Yes - disk reader, you mean, right? This is only reopened at batch boundary.

OK so a batch is quickly reopened, using bloom filter + int set for

fast "contains" check for the deletions that occurred during that batch (and, custom TermDocs that does the "and not deleted"). This gets you your fast turnaround and decent search performance.

The reopening isn't that quick, but it's in the background, or are you talking about the RAMDirectory? Yeah, that is reopened per query (if necessary - if there are no changes, of course no reopen), but it is kept very small (10k docs or less, for example). It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT

asfimport commented 14 years ago

John Wang (migrated from JIRA)

Michael:

I think I confused you by not giving you enough background information.

IntSetAccelerator holds UIDs, not docids. For each segment, the termdocs iterates its docid, we map to its corresponding uid and then check in the set.

Hopefully this clears it up.

-John

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But, I agree it's wasteful of space when deletes are so sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).

But how many msec does this clone add in practice?

Note that it's only done if there is a new deletion against that segment.

I do agree it's silly wasteful, but searching should then be faster than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the turnaround time for search perf.

I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options.

However, I suspect it's not the clone time that's very costly... I bet it's the fact that Lucene has to resolve the deletions to docIDs, in the foreground of reopen, that dominates. And probably also that Lucene doesn't yet use RAMDir (but #2390 is working towards fixing that).

Ie, Zoie is "late binding" (filters out UIDs as it encounters them during searching), while Lucene is "early binding" (immediately resolves UIDs -> docIDs during reopen). And because Zoie does the "resolve deletions to docIDs" in the BG, it's not on any query's execution path.

How does Zoie resolve UID -> docID, now? I remember a thread about this a while back...

Actually one simple fix we could make to Lucene is to resolve deletions in the foreground, when the deleteDocuments is called. This'd mean it's the thread that does the updateDocument that pays the price, rather than a future reopen. Net/net it's a zero sum game (just distributed the cost from the reopen to the indexing), but it'd mean the reopen time is minimized, which is clearly the direction we want to go in. I'll open a new issue.

So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general.

These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader.

I think I now understand it (plus John's comment that these are Zoie's UIDs not Lucene's docIDs, helped)...

When a doc needs to be updated, you index it immediately into the RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the AcceleratedIntSet, and all searches "and NOT"'d against that set. You don't tell Lucene to delete the old doc, yet.

Periodically, in the BG, you use addIndexes to push the RAMDir to disk, and, on a perhaps separate schedule, you resolve the deleted UIDs to docIDs and flush them to disk.

One question: does Zoie preserve Lucene's "point in time" searching? Is a new deletion immediately visible to all past reopened readers? I think for Lucene we need to preserve this, so we need a data structure that can be "efficiently" transactional. I guess we could consider allowing an NRT to optionally violate this, in which case we wouldn't need to do any cloning of the deleted docs.

It's a copy-on-read ThreadLocal.

Hmm – why does each thread make a full clone of the AcceleratedBitSet? Just for thread safety against additions to the set? Or is this somehow preserving "point in time"? And it fully re-clones whenever new updates have been committed to the RAMDir?

It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT

These are great results! If I'm reading them right, it looks like generally you get faster query throughput, and roughly equal indexing throughput, on upgrading from 2.4 to 2.9?

Zoie also gets much better performance than raw Lucene NRT, but this test focuses on reopen performance, I think? Ie, a query reopens the reader if any new docs were indexed? If you change that to, say, reopen once per N seconds, I wonder how the results would compare.

One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene).

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

check out the zoie perf pages: http://code\.google\.com/p/zoie/wiki/Performance_Comparisons_for_Zo ieLucene24ZoieLucene29LuceneNRT

What's missing as it pertains to this jira issue is the raw query speed (meaning the time a query takes to execute, not QPS, over various percentages of deleted docs), without concurrent indexing.

If query speed goes down, users need to know by how much.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

I'll try to get those numbers for you, they should be in our logs, and if not, it's easy enough to put them. My guess it that if zoie is doing 7x the QPS, the latency is significantly less than the NRT latency, not more, but I could be wrong.

Note that without concurrent indexing, the query speed will be the same, as all docs will be flushed to disk and the isDeleted check reduces to exactly the raw lucene case.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What's missing as it pertains to this jira issue is the raw query speed (meaning the time a query takes to execute, not QPS, over various percentages of deleted docs), without concurrent indexing.

I'd love to see this too. Ie, this would measure the query performance when deletes are checked against AcceleratedIntSet, and wouldn't measure the reopen cost (which Lucene NRT will be slower at) at all. Specifically I'd love to see the worst case queries. It's those queries that drive the cutover to a shard'd architecture.

Ie, we need to separately measure query performance vs reopen performance.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I opened #3122, to resolve deletedDoc(s) to their docID(s) in the foreground.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

But how many msec does this clone add in practice? Note that it's only done if there is a new deletion against that

segment. I do agree it's silly wasteful, but searching should then be faster than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the turnaround time for search perf.

I actually don't know for sure if this is the majority of the time, as I haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, but if you're indexing at high speed (which is what is done in our load/perf tests), you're going to be cloning these things hundreds of times per second (look at the indexing throughput we're forcing the system to go through), and even if it's fast, that's costly.

I'd love to see how the worst-case queries (matching millions of hits)

perform with each of these three options.

It's pretty easy to change the index and query files in our test to do that, that's a good idea. You can feel free to check out our load testing framework too - it will let you monkey with various parameters, monitor the whole thing via JMX, and so forth, both for the full zoie-based stuff, and where the zoie api is wrapped purely around Lucene 2.9 NRT. The instructions for how to set it up are on the zoie wiki.

When a doc needs to be updated, you index it immediately into the

RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the AcceleratedIntSet, and all searches "and NOT"'d against that set. You don't tell Lucene to delete the old doc, yet.

Yep, basically. The IntSetAccellerator (of UIDs) is set on the (long lived) IndexReader for the disk index - this is why it's done as a ThreadLocal - everybody is sharing that IndexReader, but different threads have different point-in-time views of how much of it has been deleted.

These are great results! If I'm reading them right, it looks like

generally you get faster query throughput, and roughly equal indexing throughput, on upgrading from 2.4 to 2.9?

That's about right. Of course, the comparison between zoie with either 2.4 or 2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing about 7-9x better throughput for both queries and indexing than NRT.

I'm sure the performance numbers would change if we allowed not realtimeness, yes, that's one of the many dimensions to consider in this (along with percentage of indexing events which are deletes, how many of those are from really old segments vs. newer ones, how big the queries are, etc...).

One optimization you could make with Zoie is, if a real-time deletion

(from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?

asfimport commented 14 years ago

John Wang (migrated from JIRA)

I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options.

I wrote a small program on my laptop, 100 docs in the set, iterates thru 5M numbers and calls contains(). I see 44 ms with BitVector and 64ms with IntAccelerator backed by IntOpenHashSet (from fastUtil)

This is however an extreme case, so test 2, I chose 5000 docs from the set, e.g. mod 1000 to be a candidate for check. And both sets performed equally, around 45ms.

So with the memory cost, and the allocations and clones of the BitVector, I think for us at least, using the IntSetAccelerator works well.

why does each thread make a full clone of the AcceleratedBitSet?

These are for updates, e.g. you updated doc x, it is updated to the ramdir, but it is already on the disk dir. So at query time, you need this set for dup removal.

I'd love to see this too.

Some more details on the test we ran:

NRT - indexing only *********************************************************** SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 622201 TOTAL EXECUTIONS: 622201 TOTAL SUCCESSFUL EXECUTIONS: 622201 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 ***********************************************************

zoie - indexing only SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 6265384 TOTAL EXECUTIONS: 6265384 TOTAL SUCCESSFUL EXECUTIONS: 6265384 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 ***********************************************************

zoie - update SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 1923592 TOTAL EXECUTIONS: 1923592 TOTAL SUCCESSFUL EXECUTIONS: 1923592 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 ***********************************************************

nrt - update

SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 399893 TOTAL EXECUTIONS: 399893 TOTAL SUCCESSFUL EXECUTIONS: 399893 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 ***********************************************************

Latencies:

Zoie - insert test: linear growth from 1 ms to 5 ms as index grows in the duration of the test from 0 docs to 660k docs. Zoie - update test: averaged at 9ms, as index with continuous update and stayed in 1M docs NRT - insert test: fluctuated between 17 ms to 50 ms as index grows in the duration of the test from 0 docs to 220 docs. NRT - update test: big peak when query started, latency spiked up to 550ms and then dropped and stayed steadily at 50ms, with continuous updates to stay in 1M docs.

Some observation at the NRT update test, I am seeing some delete issues, e.g. realtime deletes does not seem to reflect, and indexing speed sharply dropped.

It's quite possible that I am not using NRT the most optimal way in my setup. Feel free to run the tests yourself. I'd happy to help with the setup. One thing with Zoie is that it is a full stream indexing system with a pluggable realtime engine, so you can actually use zoie for perf testing for NRT.

One thing about the test to stress, we are testing realtime updates, so buffered indexing events up and flush once it a while is not realtime, and katta has already achieved good results with batch indexing with just minutes of delay, without making any internal changes to lucene.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks for running these tests John.

The micro-benchmark of BitVector vs IntAccelerator is nice, but, we need to see it in the real-world context of running actual worst case queries.

Zoie aims for super fast reopon time, at the expense of slower query time since it must double-check the deletions.

Lucene NRT makes the opposite tradeoff.

The tests so far make it clear that Zoie's reopen time is much faster than Lucene's, but they don't yet measure (as far as I can see) what cost the double-check for deletions is adding to Zoie for the worst-case queries.

So if you really need to reopen 100s of times per second, and can accept that your worst case queries will run slower (we're still not sure just how much slower), the Zoie approach is best.

If you want full speed query performance, and can instead reopen once per second or once every 2 seconds, Lucene's approach will be better (though we still have important fixes to make – #3122, LUCENE-1313).

Can you describe the setup of the "indexing only "test? Are you doing any reopening at all?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?

Whenever a query happens to "discover" a pending deletion, you could record somewhat the UID -> docID mapping, and then when it's time to flush the deletes you don't need to re-resolve the UIDs that the query had resolved for you. Likely in practice this would be a tiny speedup, though, so the added complexity is probably not worth it. Especially since this resolution is done in the BG for Zoie...

asfimport commented 14 years ago

John Wang (migrated from JIRA)

we need to see it in the real-world context of running actual worst case queries.

Isn't checking every document in the corpus for deletes the worse case? e.g. first test?

at the expense of slower query time

According to the test, Zoie's query time is faster.

it must double-check the deletions.

True, this double-check is only done for a candidate for a hit from the underlying query. Normally result set is much smaller than the corpus, the overhead is not large. The overhead is 1 array lookup + a delset look up vs. 1 bitvector lookup.

Can you describe the setup of the "indexing only "test?

starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

we need to see it in the real-world context of running actual worst case queries.

Isn't checking every document in the corpus for deletes the worse case? e.g. first test?

Zoie must do the IntSet check plus the BitVector check (done by Lucene), right?

Ie comparing IntSet lookup vs BitVector lookup isn't the comparison you want to do. You should compare the IntSet lookup (Zoie's added cost) to 0.

So, for a query that hits 5M docs, Zoie will take 64 msec longer than Lucene, due to the extra check. What I'd like to know is what pctg. slowdown that works out to be, eg for a simple TermQuery that hits those 5M results – that's Zoie's worst case search slowdown.

at the expense of slower query time

According to the test, Zoie's query time is faster.

The tests so far are really testing Zoie's reopen time vs Lucene's.

To test the query time, you should set up Zoie w/ some pending deletions, then turn off all indexing, then run the query test.

That would give us both extreme datapoints – how much slower Lucene is at reopening (which the current tests show), and how much slower Zoie is during searching.

it must double-check the deletions.

True, this double-check is only done for a candidate for a hit from the underlying query.

Right, Zoie's search slowdown is in proportion to the size of the result set. So eg for hard queries that produce very few results, the impact will be negligible. For simple queries that produce lots of results, the relative cost is highest (but we don't yet know what it actually is in practice).

It could still be neglible, eg since the "if" rarely triggers, the CPU should be able to predict it just fine.

Net/net, Zoie has faster reopen time than Lucene, but then pays a higher price (double check for deletion) for every result of every search. Users need to know what that price really is, in order to make informed decision about which approach is best for their situation.

Normally result set is much smaller than the corpus, the overhead is not large.

Well, this is generally app dependent, and it's the net/net worst case queries that apps need to worry about. Lucene can't [yet] take avantage of concurrency within a single query, so you're forced to shard (= big step up in deployment complexity) once your worst case queries get too slow.

Can you describe the setup of the "indexing only "test?

starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.

Oh I see: that test is just adding documents, vs the 2nd test which is doing updateDocument. Got it.

So, that's interesting... because, with no deletions, thus no resolving of Term -> docID, and no cloning of the BitVector, Lucene's reopen is still quite a bit slower.

What differences remain at this point? Just the fact that the RAM dir is being used to flush the new [tiny] segments? Hmm what about the merge policy?

asfimport commented 14 years ago

John Wang (migrated from JIRA)

Zoie will take 64 msec longer than Lucene, due to the extra check.

That is not true. If you look at the report closely, it is 20ms difference, 64ms is the total size. (after I turned on -server, the diff is about 10ms). This is running on my laptop, hardly a production server.

This is also assuming the entire corpus is returned, where we should really take an average of the result set from the query log.

However, to save this "overhead", using BitVector is wasting a lot of memory, which is expensive to clone, new and gc. In a running system, much of that cost is hard to measure. This is simply a question of trade-offs.

Again, I would suggest to run the tests yourself, afterall, it is open source :) and make decisions for yourself, this way, we can get a better understanding from concrete numbers and scenarios.

BTW, is there a performance benchmark/setup for lucene NRT?

The tests so far are really testing Zoie's reopen time vs Lucene's

That is not true either. This test is simply testing searching with indexing turned on. Not specific to re-open. I don't think the statement that the performance difference is solely due to reopen is substantiated. I am seeing the following with NRT:

e.g. 1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking. 2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look. 3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically.

Again, this can all be specific to my usage of NRT or the test setup. That is why I urge you guys to run our tests yourself and correct us if you see areas we are missing to make a fair comparison.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

The BitVector memory consumption is 125,000 bytes/122K for 1 million documents. I'd be surprised if copying a byte array has a noticeable performance impact.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

But Jason, if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

asfimport commented 14 years ago

John Wang (migrated from JIRA)

wrote a little pgm on my mac pro (8 core 16GM mem beefy machine. 100 threads mimic real load, each thread loops 100 times doing just a new on a BitVector for an index of numDocs (configurable) and take the average number.

5M docs, 16 - 18 ms overhead 10M docs, 35 - 40 ms overhead.

That is not insignificant.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

Zoie must do the IntSet check plus the BitVector check (done by

Lucene), right?

Yes, so how does Lucene NRT deal with new deletes? The disk-backed IndexReader still does its internal check for deletions, right? I haven't played with the latest patches on #2390, so I'm not sure what has changed, but if you're leaving the disk index alone (to preserve point-in-time status of the index without writing to disk all the time), you've got your in-memory BitVector of newly uncommitted deletes, and then the SegmentReaders from the disk have their own internal deletedDocs BitVector. Are these two OR'ed with each other somewhere? What is done in NRT to minimize the time of checking both of these without modifying the read-only SegmentReader? In the current 2.9.0 code, the segment is reloaded completely on getReader() if there are new add/deletes, right?

Ie comparing IntSet lookup vs BitVector lookup isn't the comparison

you want to do. You should compare the IntSet lookup (Zoie's added cost) to 0.

If you've got that technique for resolving new deletes against the disk-based ones while maintaining point-in-time nature and can completely amortize the reopen cost so that it doesn't affect performance, then yeah, that would be the comparison. I'm not sure I understand how the NRT implementation is doing this currently - I tried to step through the debugger while running the TestIndexWriterReader test, but I'm still not quite sure what is going on during the reopen.

So, for a query that hits 5M docs, Zoie will take 64 msec longer than

Lucene, due to the extra check. What I'd like to know is what pctg. slowdown that works out to be, eg for a simple TermQuery that hits those 5M results - that's Zoie's worst case search slowdown.

Yes, this is a good check to see, for while it is still a micro-benchmark, really, since it would be done in isolation, while no other production tasks are going on, like rapid indexing and the consequent flushes to disk and reader reopening is going on, but it would be useful to see.

What would be even better, however, would be to have a running system whereby there is continual updating of the index, and many concurrent requests are coming in which hit all 5M documents, and measure the mean latency for zoie in this case, in both comparison to NRT, and in comparison to lucene when you don't reopen the index (ie. you do things the pre-lucene2.9 way, where the CPU is still being consumed by indexing, but the reader is out of date until the next time it's scheduled by the application to reopen). This would measure the effective latency and throughtput costs of zoie and NRT vs non-NRT lucene. I'm not really sure it's terribly helpful to see "what is zoie's latency when you're not indexing at all" - why on earth would you use either NRT or zoie if you're not doing lots of indexing?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

300 documents a second

Whoa, pretty insane volume.

how many of these BitVectors are you going to end up making?

A handful by pooling the BitVector fixed size bytes arrays (see LUCENE-1574). I'm not sure if the synchronization on the pool will matter. If it does, we can use ConcurrentHashMap like Solr's LRUCache. Granted, JVMs are supposed to be able to handle rapid allocation efficiently, however I can't see the overhead of pooling being too significant. If it is, there's always the default of allocating new BVs.

I really need a solution that will absolutely not affect query performance from what is today. Personally, pooling is the safest route for me to use in production as then, there are no worries about slowing down queries with alternative deleted docs mechanisms. And the memory allocation is kept within scope. The overhead is System.arraycopy, which no doubt will be insignificant for my use case.

http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.5 http://www.javapractices.com/topic/TopicAction.do?Id=3

I suppose if one has fairly simple queries and is willing to sacrifice query performance for update rate, then other deleted docs mechanisms may be a desired solution. I need to implement a more conservative approach.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

Whoa, pretty insane volume.

Aiming for maxing out indexing speed and query throughput at the same time is what we're testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time search.

A handful by pooling the BitVector fixed size bytes arrays (see LUCENE-1574).

Pooling, you say? But what if updates come in too fast to reuse your pool? If you're indexing at the speeds I'm describing, won't you run out of BitVectors in the pool?

I really need a solution that will absolutely not affect query performance from what is today

"You" really need this? Why is the core case for real-time search a scenario where taking a hit of a huge reduction in throughput worth a possible gain in query latency? If the cost was 20% query latency drop in exchange for 7x throughput cost when doing heavy indexing, is that worth it? What about 10% latency cost vs 2x throughput loss? These questions aren't easily answered by saying real-time search with Lucene needs to absolutely not affect query performance from what it is today. These kinds of absolute statements should be backed up by comparisons with real performance and load testing.

There are many axes of performance to optimize for:

Saying that one of these is absolutely of more importance than the others without real metrics showing which ones are affected in which ways by different implementation choices is doing a disservice to the community, and is not by any means "conservative".

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking. 2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look. 3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically.

These sound serious – if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting ;)

Yes, so how does Lucene NRT deal with new deletes?

Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" – it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

Yes, this makes Lucene's reopen more costly. But, then there's no double checking for deletions. That's the tradeoff, and this is why the 64 msec is added to Zoie's search time. Zoie's searches are slower.

The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird – that means something else is up, besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which #2390 addresses). Or, maybe there's another difference in that test (eg, MergePolicy?). Jake or John, if you could shed some light on any other specific differences in that test, that would help.

This is simply a question of trade-offs.

Precisely: Zoie has faster reopen time, but slower search time. But we haven't yet measured how much slower Zoie's searches are.

Actually I thought of a simple way to run the "search only" (not reopen) test – I'll just augment TopScoreDocCollector to optionally check the IntSetAccelerator, and measure the cost in practice, for different numbers of docs added to the IntSet.

BTW, is there a performance benchmark/setup for lucene NRT?

In progress – see #3126.

Aiming for maxing out indexing speed and query throughput at the same time is what we're testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time search.

But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc.

It sounds like LinkedIn has a hard requirement that the reopen must happen hundreds of times per second, which is perfectly fine. That's what LinkedIn needs. But other apps have different requirements, and so to make an informed decision they need to see the full picture.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

Hopefully not much more than a few per second?

We should be careful what we measure to ensure that we're targeting the right use cases. Seems like almost all apps should be well served by second reopen resolution on average (with the ability to immediately reopen on demand). The only thing that would seem to need lower latency is when an automated client does an add, and then immediately does a query and needs to see it. In that case, that client could specify that they need an immediate reopen (either during the add or the query).

Requirements calling for zero latency updates (all index changes are always visible) are often in error (i.e. it's usually not a true requirement).

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

These sound serious - if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting

Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we saw it by indexing up to a million docs, then keep adding docs but only after doing a delete on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs (not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT, it starts slowly growing above 1M.

The CPU and memory is undoubtedly due to the constant reopening of these readers, and yes, could be aleiviated by not doing this - we're just comparing to the zoie case, where we do reopen (the RAMDir) on every request (and copy the delSet) if there have been modifications since the last update.

Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" - it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment? If the disk index is optimized that means it would happen on every update, a clone of the full numDocs sized BitVector? I'm still a little unsure of how this happens.

Yes, this makes Lucene's reopen more costly. But, then there's no double checking for deletions. That's the tradeoff, and this is why the 64 msec is added to Zoie's search time. Zoie's searches are slower.

So we re-ran some of our tests last night, commenting out our deleted check to measure it's cost in the most extreme case possible: a dead easy query (in that it's only one term), but one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased in our code, and is perfectly fast, so not a good worst case to check), and as the index grows up above a million documents, zoie could shave somewhere from 22-28% of its time off by not doing the extra check.

We haven't re-run the test to see what happens as the index grows to 5M or 10M yet, but I can probably run that later today.

The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird - that means something else is up,

besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which #2390 addresses).

Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true), right? Without having #2390, this kills the indexing performance if querying is going on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() (that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable] and then writing segments to disk with addIndexesNoOptimize() periodically).

I don't think there's any difference in the MergePolicy - I think they're both using the BalancedMergePolicy (since that's the one which is optimized for the realtime case).

Actually I thought of a simple way to run the "search only" (not reopen) test - I'll just augment TopScoreDocCollector to optionally check the IntSetAccelerator, and measure the cost in practice, for different numbers of docs added to the IntSet.

Due to the bloomfilter living on top of the hashSet, at least at the scales we're dealing with, we didn't see any change in cost due to the number of deletions (zoie by default keeps no more than 10k modifications in memory before flushing to disk, so the biggest the delSet is going to be is that, and we don't see the more-than-constant scaling yet at that size).

But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc.

Yep, this is true. It's a little more invasive to put this into Zoie, because the reopen time is so fast that there's no pooling, so it would need to be kinda hacked in, or tacked on to the outside. Not rocket science, but not just the change of a parameter.

LinkedIn doesn't have any hard requirements of having to reopen hundreds of times per second, we're just stressing the system, to see what's going on. As you can see, nobody's filing a bug here that Lucene NRT is "broken" because it can't handle zero-latency updates. What we did try to make sure was in the system was determinism: not knowing whether an update will be seen because there is some background process doing addIndexes from another thread which hasn't completed, or not knowing how fresh the pooled reader is, that kind of thing.

This kind of determinism can certainly be gotten with NRT, by locking down the IndexWriter wrapped up in another class to keep it from being monkeyed with by other threads, and then tuning exactly how often the reader is reopened, and then dictate to clients that the freshness is exactly at or better than this freshness timeout, sure. This kind of user-friendliness is one of Zoie's main points - it provides an indexing system which manages all this, and certainly for some clients, we should add in the ability to pool the readers for less real-timeness, that's a good idea.

Of course, if your reopen() time is pretty heavy (lots of FieldCache data / other custom faceting data needs to be loaded for a bunch of fields), then at least for us, even not needing zero-latency updates means that the more realistically 5-10% degredation in query performance for normal queries is negligable, and we get deterministic zero-latency updates as a consequence.

This whole discussion reminded me that there's another realtime update case, which neither Zoie nor NRT is properly optimized for: the absolutely zero deletes case with very fast indexing load and the desire for minimal latency of updates (imagine that you're indexing twitter - no changes, just adds), and you want to be able to provide a totally stream-oriented view on things as they're being added (matching some query, naturally) with sub-second turnaround. A subclass of SegmentReader which is constructed which doesn't even have a deletedSet could be instantiated, and the deleted check could be removed entirely, speeding things up even further.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

These sound serious - if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting

Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we saw it by indexing up to a million docs, then keep adding docs but only after doing a delete on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs (not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT, it starts slowly growing above 1M.

So far I've had no luck repro'ing this. I have a 5M doc wikipedia index. Then I created an alg with 2 indexing threads (each replacing docs at 100 docs/sec), and reopening \~ 60 times per second. Another thread then verifies that the docCount is always 5M. It's run fine for quite a while now...

Hmm maybe I need to try the balanced merge policy? That would be spooky if it caused the issue...

asfimport commented 14 years ago

John Wang (migrated from JIRA)

Correction: We are NOT using BalancedMergePolicy for NRT test, only default is used. BalancedMergePolicy is only used inside zoie.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

As far as the implementation of this patch goes... I'm thinking we can increase the page size, and do a simulated setNextPage type of deal in SegmentTermDocs. We'll start at the first page, if we hit an ArrayIndexOutOfBoundsException, we figure out what page they're trying to access and return true|false for the bit. We can continue this process of accessing iteratively on a page, until we hit the AIOOBE, then figure out again. I think this is a good approach because Java arrays already perform the access bounds check, hitting an exception hopefully shouldn't be too costly if it only happens a handful of times per posting iteration, and then we're avoiding the small but extra array access lookup for eac bit. We'll be executing the same access pattern as today (i.e. random access on the byte array with about the same overhead, with the AIOOBE occurring when a page is completed). We can benchmark and see the performance difference.

I think in general, we'll simply benchmark the options we've discussed 1) NRT 2) 1313 3) 1313 + pooling 4) 1313 + 1526 5) 1313 + 1526 w/iterative sequential page access.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Correction: We are NOT using BalancedMergePolicy for NRT test, only default is used. BalancedMergePolicy is only used inside zoie.

OK.

I've also tried doing separate delete then add in the indexer threads, and still I don't see any deletions getting lost... I can't repro this correctness issue.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look.

CPU starvation is fully expected (this is a redline test).

Memory starvation is interesting, because the bit vectors should all be transient, and should "die young" from the GC's standpoint. Plus these are all 1/8th the number of docs in RAM usage, and it's only those segments that have deletions whose bit vector is cloned. Are you starting from an optimized index?

Oh, here's one idea: how many searches does your test allow to be in-flight at once? (Or: how large a thread pool are you using on the server?). Since you effectively reopen per search, each search will have dup'd the deleted docs. If you allow many searches in flight, that could explain it.