asfimport commented 8 years ago

8266 converted IntersectsPrefixTreeQuery to use DocIdSetBuilder, but didn't actually reduce garbage generation for my Solr index.

Since something like 40% of my garbage (by space) is now attributed to DocIdSetBuilder.growBuffer, I charted a few different allocation strategies to see if I could tune things more.

See here: http://i.imgur.com/7sXLAYv.jpg The jump-then-flatline at the right would be where DocIdSetBuilder gives up and allocates a FixedBitSet for a 100M-doc index. (The 1M-doc index curve/cutoff looked similar)

Perhaps unsurprisingly, the 1/8th growth factor in ArrayUtil.oversize is terrible from an allocation standpoint if you're doing a lot of expansions, and is especially terrible when used to build a short-lived data structure like this one. By the time it goes with the FBS, it's allocated around twice as much memory for the buffer as it would have needed for just the FBS.

Migrated from LUCENE-7258 by Jeff Wartes, resolved May 23 2016 Attachments: allocation_plot.jpg, LUCENE-7258.patch, LUCENE-7258-expanding.patch, LUCENE-7258-Tune-memory-allocation-rate-for-Intersec.patch (versions: 2)

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Wonderful graphs Jeff Wartes!

Any thoughts on this one @jpountz? I think a 2x growth rate makes more sense to me, and FWIW aligns with what java.util.ArrayList does. org.apache.lucene.util.ArrayUtil.oversize says:

    // asymptotic exponential growth by 1/8th, favors
    // spending a bit more CPU to not tie up too much wasted
    // RAM:

I have had blind faith to simply use ArrayUtil.grow/oversize whenever I need to grow an array but it's shaken now. I guess it depends. If the array to be built is temporary, then don't use it – grow by 2x; but if it may be long-lived then use it.

asfimport commented 8 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Seems like we should do the same thing that was done for SOLR-8922 - Don't reallocate a single array, keep a list of them.

For extra credit, one could even try to eliminate the cost of coalescing all of the arrays into a single one by incorporating that step into the first step of the radix sort.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

The "eia" label represents using the ExpandingIntArray approach from SOLR-8922. It suffered somewhat in my plot because I accounted for the fact that when you're done collecting, you need to convert it to a single array for sorting purposes. (if you haven't overflowed into a FBS, anyway, and want to use the usual Sorters.)

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

This patch does the following:

Moves the FBS threshold from 1/128th to 1/256th for IntersectsPrefixTreeQuery.
Changes the expansion policy to 2x when used by IntersectsPrefixTreeQuery
Changed the sort algorithm in DocIdSetBuilder (for ALL usages) to InPlaceMergeSorter, since LSBRadixSorter requires allocating a new array of size N.
In order to do #1 & #2, I had to add parameter support for the threshold and expansion policies.

Justifications:

Since Geospatial data is typically non-uniform, a smaller threshold seemed reasonable.
A more aggressive expansion policy results in less wasted allocations, particularly for short-lived data structures.
This one might be controversial since it affects more than just geospatial search, but I thought I'd see what happened if I saved the memory. I also considered TimSort, which has a configurable memory cost, but #6204 gave me some pause.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

I put this patch on a production node this morning, looks like allocation rate went down about 10%, which I think is pretty good considering only about 15% of my queries even have a geospatial component.

CPU usage has not changed enough for me to notice.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

Attaching the graph directly.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

Random aside: I did do one test run where I changed all usages ArrayUtil.oversize to use an expansion of 2x. I recall this increased overall allocations on my test query corpus by about 4%, when compared to the 256th/2x applied to only the IntersectsPrefixTreeQuery.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Any thoughts on this one Adrien Grand?

Good question, I don't really know. :) I think Python uses 9/8 like ArrayUtil and Java uses 1.5 and I heard arguments against growth factors of 2 as they prevent freed memory from being reused since you always need to allocate more than what has been freed so far, but it's not clear to me how it applies to a managed runtime like Java. I think the approach from SOLR-8922 is worth trying too.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

After 24 hours, I found I could discern a penalty to cpu on my patched node. I removed the change in sort algorithm, and that seems to have resolved it without too significantly changing the allocation savings.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

I'd be interested in trying TimSort, or something like @yonik suggested where an ExpandingIntArray-style array of arrays is fed directly into the Radix sort, but I'm not sure I'm going to be able to commit much more time to this for a bit.

That said, in the process of thinking about this, I do have a few git stashes saved off with sketches for things like using TimSort and using ExpandingIntArray that I could try to clean and post if anyone is interested.

I also have one sketch I started for using a loose pool mechanism to front acquiring a FixedBitSet, but I didn't get deep enough to be able to tell with confidence that a FBS was actually not being used anymore. Things like the public FixedBitSet.getBits() method make it scary, although I'm convinced even a very small pool of large FixedBitSets could be extremely advantageous. There aren't that many in use at any given time, and a large FBS can still be used for a small use-case. If anyone has some pointers around the lifecycle here, I'd love to hear them.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I played with the scaling factor and the "poly 5" geo benchmark by temporarily switching from MatchingPoints to DocIdSetBuilder in LatLonPointInPolygonQuery. I got the following QPS:

scaling factor = 9/8 (like in master)	48.3
scaling factor = 5/4	49.3
scaling factor = 3/2	50.2
scaling factor = 2	50.9
MatchingPoints	51.7

This gets DocIdSetBuilder closer to the throughput of MatchingPoints in spite of the fact it tries to better deal with the sparse case. Given than wasting space is not a big deal for this class (the data will be trashed once the query finishes running), I would be in favor of moving to a scaling factor of 3/2 or 2.

Regarding reusing fixed bitsets, I think the only way would be to keep state on the index searcher and then have access to the cache in Query.createWeight. But I don't think I would like it: this looks quite dangerous to me as bit sets can take a lot of memory and you need a different cache per thread (if your index has 1B documents, you would need 120MB per thread for a single FixedBitSet, while a single query may need to create several of them).

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

I'm not sure I understand how the dangers of large FBS size would be any different with a pooling mechanism than they are right now. If a query needs several of them, then it needs several of them, whether they're freshly allocated or not. The only real difference I see might be whether that memory exists in the tenured space, rather than thrashing the eden space every time.

I don't think it'd need to be per-thread. I don't mind points of synchronization if they're tight and well understood. Allocation rate by count is generally lower here. One thought: https://gist.github.com/randomstatistic/87caefdea8435d6af4ad13a3f92d2698

To anticipate some objections, there are likely lockless data structures you could use, and yes, you might prefer to control size in terms of memory instead of count. I can think of a dozen improvements per minute I spend looking at this. But you get the idea. Anyone anywhere who knows for sure they're done with a FBS can offer it up for reuse, and anyone can potentially get some reuse by just changing their "new" to "request". If everybody does this, you end up with a fairly steady pool of FBS instances large enough for most uses. If only some places use it, there's no chance of an unbounded leak, you might get some gain, and worst-case you haven't lost much. If nobody uses it, you've lost nothing.

Last I checked, something like a full 50% of (my) allocations by size were FixedBitSets despite a low allocation rate by count, or I wouldn't be harping on the subject. As a matter of principle, I'd gladly pay heap to reduce GC. The fastest search algorithm in the world doesn't help me if I'm stuck waiting for the collector to finish all the time.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

When I said per thread, I did not mean that the pool needs to be per thread, I just wanted to highlight that if you want N concurrent requests to all use this cache, you need a cache of quite a significant size. Sorry for not being encouraging but I worked on similar caches before and I was not happy with the end result: it is sometimes hard to know when an instance can actually be recycled, it is hard to know when the cache should give memory back to the JVM (especially for a library), the fact that you often end up using oversized instances encourages to use paging but then things are slower, etc. Relying on the JVM makes things simpler.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

It appears to me that caching bitsets is a much easier task than most any other caches I've seen – there is no key and most (half on average?) bitsets in the cache will be long enough to be re-used by some subsequent lookup. RE knowing when an instance can be recycled – if it were in conjunction with the QueryCache, then on eviction the bitset can be put into a bitset cache. RE knowing when bitsets should be evicted, especially for a library – make that configurable/disable? My main concern with such a cache is its overall code impact – how many places would be touched by it. Perhaps a lot but maybe not too bad? And of course for what measurable benefit? I imagine some of the GC cost of the current situation can be addressed with GC tuning – say raising the young gen via -Xmn.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think we shoudl pool any bitsets. This is the job for the garbage collector.

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

There are actually three threads going on this ticket right now, there’s the “what threshold and expansion to use for geospatial” that I’d originally intended and provided a patch for, there’s the “what expansion for DocIdSetBuilder is generically optimal”, and there’s the “FBS is 50% of my allocation rate, can we pool” conversation.

I think the latter is a worthy conversation, and I don’t have a better place for it, so I’m going to continue to respond to the comments along those lines, (with apologies for the book I’m writing here) but I wanted to point out the divergence.

So, I certainly understand a knee-jerk reaction around using object pools of any kind. Yes, this IS what the JVM is for. It’s easier and simpler and lower maintenance to just use what’s provided. But I could also argue that Arrays.sort has all those same positive attributes and that hasn’t stopped several hand-written sort algorithms get into this codebase. The question is actually whether the easy and simple thing is good enough, or whether the harder thing has a sufficient offsetting benefit. Everyone on this thread is a highly experienced programmer, we all know this.

In this case, that means the question is actually whether the allocation rate is “good enough” or if there's a sufficiently offsetting opportunity for improvement, and arguments should ideally come from that analysis.

I can empirically state that for my large Solr index, that GC pause is the single biggest detriment to my 90+th percentile query latency. Put another way, Lucene is fantastically fast, at least when the JVM isn’t otherwise occupied. Because of shard fan-out, a per-shard p90 latency very quickly becomes a p50 latency for queries overall. (Even with mitigations like SOLR-4449) I don’t think there’s anything particularly unique to my use-case in anything I just said, except possibly the word “large”.

As such, I consider this an opportunity for improvement, so I’ve suggested a mitigation strategy. It clearly has some costs. I’d be delighted to entertain any alternative strategies.

Actually, @dsmiley did bring up one alternative suggestion for improvement, so let’s talk about -Xmn:

First, let’s assume that Lucene’s policy on G1 hasn’t changed, and we’re still talking about ParNew/CMS. Second, with the exception of a few things like cache, most of the allocations in a Solr/Lucene index are very short-lived. So it follows that given a young generation of sufficient size, the tenured generation would actually see very little activity.

The major disadvantage to just using a huge young generation then is that there aren’t any concurrent young-generation collectors. The bigger it is, the less frequently you need to collect, but the longer the stop-the-world GC pause when you do. On the other end of the scale, a very small young space means shorter pauses, but far more frequent. Since almost all garbage is short-lived, maybe now you're doing young-collections so often that you’ve got the tenured collector doing a bunch of the work cleaning up short-lived objects too. (This can actually be a good thing, since the CMS collector is mostly concurrent)

There’s some theoretical size that optimizes frequency vs pause for averaged latency. Perhaps even by deliberately allowing some premature overflow into tenured simply because tenured can be collected concurrently. This kind of thing is extremely delicate to tune for though, especially since query rate (and query type distribution) can fluctuate. It’s easy to get it wrong, such that a sudden large-allocation slams past the rate CMS was expecting and triggers a full-heap stop-the-world pause.

I’m focusing on FBS here because: 1. Fifty Percent. 2. These are generally larger objects, so mitigating those allocations seemed like a good way to mitigate unexpected changes in allocation rate and allow more stable tuning.

There’s probably also at least one Jira issue around looking at object count allocation rate (vs size) since I suspect the single biggest factor in collector pause is the object count. Certainly I can point to objects that get allocated (by count) in orders of magnitude greater frequency than then next highest count. But since I don’t have a good an understanding of the use cases, let alone have any suggestions yet, I’ve left that for another time.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If you have problems from large fixedbitsets, maybe the first thing to address is that top level solr filter cache :)

asfimport commented 8 years ago

Jeff Wartes (migrated from JIRA)

Ok, yeah, that’s a reasonable thing to assume. We usually think of it in terms of cpu work, but filter caches would be an equally great way to mitigate allocations. But a cache is really only useful when you’ve got non-uniform query distributions, or enough time-locality at your query rate that your rare queries haven’t faced a cache eviction yet.

I’m indexing address-type data. Not uncommon. I think that if my typical geospatial search were based on some hyper-local phone location, we’d be done talking, since a filter cache would be useless.

So maybe we should assume I’m not doing that.

Let’s assume I can get away with something coarse. Let’s assume I can convert all location based queries to the center point of a city. Let’s further assume that I only care about one radius per city. Finally, let’s assume I’m only searching in the US. There are some 40,000 cities in the US, so those assumptions yield 40,000 possible queries. That’s not too bad.

With a 100M-doc core, I think that’s about 12.5Mb per filter cache entry. It could be less, I think, particularly with the changes in SOLR-8922, but since we’re only going with coarse queries, it’s reasonable to assume there’s going to be a lot of hits. I don’t need every city in the cache, of course, so maybe… 5%? That’s only some 25G of heap. Doable, especially since it saves allocation size and you could probably trade in more of the eden space. (Although this would make warmup more of a pain) I’d probably have to cross the CompressedOops boundary at 32G of heap to do that too though, so add another 16G to get back to baseline.

Fortunately, the top 5% of cities probably maps to more than 5% of queries. More populated cities are also more likely targets for searching in most query corpuses. So assuming it’s the biggest 5% that are in the cache, maybe we can assume a 15% hit rate? 20%?

Ok, so now I’ve spent something like 41G of heap, and I’ve reduced allocations by 20%. Is this pretty good?

I suppose it’s worth noting that this also assumes a perfect cache eviction policy, (I’m pretty interested in SOLR-8241) and that there’s no other filter cache pressure. (At the least, I’m using facets - SOLR-8171)

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Out of curiosity, I played with Solr's ExpandingIntArray approach that consists of accumulating buffers of exponentially increasing sizes. In the attached patch, the size of a new buffer is equal to the sum of the sizes of the buffers that have been allocated so far, so in terms of growth rate it is similar to a scaling factor of 2. Since we need to reserve space up-front, I added a protection that resizes the current array rather than adding a new one if it is less than 7/8 full, otherwise it could leave a non negligible amount of wasted space (eg. if you always call grow(remainingNumberOfSlotsInCurrentBuffer + 1) and then only call add() once).

Interestingly it performed significantly better than the current approach, so maybe we should switch to it?

Benchmark	Poly 5	Box
scaling factor = 9/8 (like in master)	48.5	65.6
scaling factor = 5/4	49.2	67.6
scaling factor = 3/2	50.2	69.1
scaling factor = 2	50.8	69.6
ExpandingIntArray-style	51.6	71.0
MatchingPoints	51.8	71.7

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

nit: typo in comment line 96: "cumulated" -> "accumulated"

In ensureBufferCapacity, when buffers.isEmpty, I think the first buffer should have a minimum size of 64 (or 32?), not 1. This will avoid a possible slow start of small buffers when numDocs is 0 or 1. At least I saw this while setting a breakpoint in some spatial tests, seeing the first two buffers both of size one and the 3rd of size two, etc. Also in this method...

if (current.length < current.array.length - (current.array.length >>> 2)) {
      // current buffer is less than 7/8 full, resize rather than waste space

That calculation is not 7/8, it's 3/4.

In concat(), it'd be helpful to comment that it not only concatenates but leaves one additional space too. Also... do you think it might be worth optimizing for the case that there is one buffer that can simply be returned? If when this happens it tends to be exactly full then maybe when we allocate new buffers we can leave that one additional slot there so that this happens more often.

For readability sake, can you re-order the methods grow, ensureBufferCapacity, and addBuffer, growBuffer, upgradeToBitSet to be in this sequence (or thereabouts) as that is the sequence of who calls who? I find it much easier to read code top to bottom than bottom up :-) Likewise, build() could be defined before the private utility methods it calls.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for the catches, I played with many different options and my comments went out of sync with the code. :)

> In ensureBufferCapacity, when buffers.isEmpty, I think the first buffer should have a minimum size of 64 (or 32?), not 1. This will avoid a possible slow start of small buffers when numDocs is 0 or 1. At least I saw this while setting a breakpoint in some spatial tests, seeing the first two buffers both of size one and the 3rd of size two, etc.

Fair enough.

> do you think it might be worth optimizing for the case that there is one buffer that can simply be returned?

Good question. I believe this would only help on small segments, but this sounds easy so maybe we should do it.

I'll do the reorderings.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@dsmiley Here is a new patch that should address your comments. Numbers are the same as those reported in my previous comment.

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I reviewed the patch... just one issue I see: concat() now has a comment that it might return a re-used buffer but I don't see that it does? (or am I wrong?) Otherwise, looks good.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@dsmiley This is because of the following block of code. If the largest buffer is large enough to hold all docs then we reuse it. I tried to generalize your suggestion to reuse the buffer if there is a single buffer.

+    int[] docs = largestBuffer.array;
+    if (docs.length < totalLength + 1) {
+      docs = Arrays.copyOf(docs, totalLength + 1);
+    }

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Ooooh, I'm blind sorry ;-P This is even better than my original idea since the array can be re-used if there is more than one buffer in some circumstances. Cool. +1 to commit; nice Adrien!

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit bcc4e8709e8de3ae9688304be32a2b4b39bc0c03 in lucene-solr's branch refs/heads/branch_6x from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bcc4e87

LUCENE-7258: Speed up DocIdSetBuilder allocation.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 052fb97f82862dfa62f0e37f572523ba619be4ea in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=052fb97

LUCENE-7258: Speed up DocIdSetBuilder allocation.

apache / lucene

Tune DocIdSetBuilder allocation rate [LUCENE-7258] #8313

8266 converted IntersectsPrefixTreeQuery to use DocIdSetBuilder, but didn't actually reduce garbage generation for my Solr index.