Closed asfimport closed 5 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
Thanks for opening this one @tokee. I agree that the current encoding is not great, it was mostly a minor evolution from the previous dense format in order for things to work not too bad with sparse data, we should probably re-think it entirely. I know Robert suggested we should try to encode things more like postings, eg. with skip data.
Unfortunately this lookup-table cannot be generated upfront when the writing of values is purely streaming
Since this lookup table would be loaded into memory, we would typically write it into the meta file (extension: dvm) rather than into the data file (extension: dvd) so I don't think it would be an issue in practice.
It would be far better to generate the rank-structure at index time and store it immediately before the bitset (this is possible with streaming as each block is fully resolved before flushing), but of course that would require a change to the codec.
+1 FWIW I don't have any objections to modifying the current format for doc values, I think it's much needed. Happy to help if you need guidance.
Toke Eskildsen (@tokee) (migrated from JIRA)
Thank you for the suggestions, @jpountz. I am not at home in this part of the code base, so guidance is much appreciated. I will look into how postings work.
Putting the block-index in the separate meta file sounds like the right solution. What about also putting the rank structure in there too? That way the existing data file would keep its format and if a setup chooses not to use the lookup structures (time will show if that makes sense for any setups), its inactive existence will not affect disk caching of the data file. One downside I see is that it would require all the rank structures to be kept in-memory instead of being temporarily loaded as part of accessing a DENSE block.
On the other hand, offloading the support structures to the meta file ties very well into a two-stage approach where the first stage is query-time improvement for version 7 and the second is a codec-change for version 8. I must admit to having a selfish reason for trying to get version 7 to perform better: Our index takes 50 CPU-years / 3 realtime months to regenerate, so we would like not to do so.
What is the contract for the slicer? Is seeking strictly forward, so that a binary search for SPARSE
can only be done by loading the full block temporarily onto the heap? It would be nice to also have an improvement for SPARSE
, if possible.
Adrien Grand (@jpountz) (migrated from JIRA)
What about also putting the rank structure in there too?
In general the rule is that if you plan to have something in memory all the time then it should go to the meta file, and otherwise to the data file. We just need to be careful to not load too much stuff in memory in order to keep the index memory-efficient. Holding the block index in memory (one long every 65k docs) might be reasonable. I'm less sure about the long[32], though if I understand things correctly, having it in memory wouldn't help much compared to putting it at the beginning of the DENSE blocks and loading it only when we know we'll need at least one doc from this block.
For what it's worth, the codec API makes it very easy to deal with backward compatibility, so there would be no problem with completely changing the default doc-value format in a minor release. It doesn't have to wait for 8.0.
What is the contract for the slicer?
It may seek to any legal offset (positive and less than the file length), doesn't need to be strictly forward, could be backward.
David Smiley (@dsmiley) (migrated from JIRA)
Welcome back to Lucene dev @tokee :)
Toke Eskildsen (@tokee) (migrated from JIRA)
Putting the rank-structure with the data would have the benefit that they would only be stored for the DENSE blocks: Putting it in meta would mean either unused entries for non-DENSE blocks or some sort of sparse representation, adding yet another complicated layer. And yes, if the rank is always used (which would be my guess), there would probably be very little performance difference between having it in-memory or together with the data that needs to be read anyway.
As for the rank structure itself, it is trade-off between space and lookups.
From an overhead/performance perspective, the 2560 bit version looks good to me. Unfortunately it does not align well with 65536 bits, so it's a bit messy compared to the very clean 2048 bit one. ...I am probably over-thinking it here. The difference between iterative and rank-based lookups is hopefully be apparent either way.
For what it's worth, the codec API makes it very easy to deal with backward compatibility, so there would be no problem with completely changing the default doc-value format in a minor release. It doesn't have to wait for 8.0.
This surprises me. A bit of a dangerous thing, is is not? No temporarily switching back to a known stable sub-version of Solr if a new release within the same major version turns out to have severe problems.
Toke Eskildsen (@tokee) (migrated from JIRA)
@dsmiley thank you. Let's hope my activeness sticks this time. Why I have been inactive is probably not suitable for Jira :P
Adrien Grand (@jpountz) (migrated from JIRA)
No temporarily switching back to a known stable sub-version of Solr if a new release within the same major version turns out to have severe problems.
This is something that is not guaranteed today, and we actually broke this on many minor releases.
Toke Eskildsen (@tokee) (migrated from JIRA)
An unforeseen slump in firefighting at work means that I found time for this issue.
I implemented the block-lookup structure and the DENSE
rank cache for Solr trunk (see attached patch). The patch also seems to apply cleanly to Solr 7.3.1. It uses a singleton to hold the caches, spams stdout with stats on cache creation and in general integreates poorly, but cashing-wise it should work well enough.
The implementation does not use a sparse structure to hold the rank cache, making its size linear to the number of documents. Using a sparse representation, this overhead can be near-linear to the number of DENSE
blocks instead. Extending to change the index format would automatically make the rank representation linear to the number of DENSE
blocks.
It forcibly enables itself for most IndexedDISI
-usages by Lucene70DocValuesProducer
, meaning that this patch is query-time only: See the stdout
-log for the first-lookup-after-segment-open cache-building time overhead.
Looking up DocValues is (of course) used throughout Lucene/Solr, so I would expect the impact - if any - to be visible across a range of functions. Requesting large result sets with fields that are stored as DocValues is our prime point of hurt, but I guess that stats, facets & exports can also benefit.
The perfect fit for this patch is an index with large (millions of documents) segments with fields that are mostly DENSE (defined for <100% and > 6% of all documents).
I re-ran the test from the JIRA-description and got the improvements below. I must stress that our setup is special with a segment of 300M documents, and that the fields with the largest improvements are DENSE
, so for nearly everyone else, the benefits will be much lower. I will of course run tests on other indexes later, but I prioritized getting a patch out.
Field | Type | Docs with value | Docs w/ val % | Vanilla speed in docs/sec | Cached speed in docs/sec | cached/ vanilla |
---|---|---|---|---|---|---|
url | String | 307,171,504 | 100% | 12,500 | 20,000 | 160% |
content_type_ext | String | 224,375,378 | 73% | 360 | 14,000 | 3800% |
author | String | 1,506,365 | 0.5% | 1,100 | 25,000 | 2200% |
crawl_date | DatePoint | 307,171,498 | \~100% | 90 | 90 | 100% |
content_text_length | IntPoint | 285,800,212 | 93% | 410 | 22,000 | 5300% |
content_length | IntPoint | 307,016,816 | 99.9% | 100 | 90 | 90% |
crawl_year | IntPoint | 307,171,498 | \~100% | 14,500 | 30,000 | 200% |
last_modified | DatePoint | 6,835,065 | 2.2% | 570 | 30,000 | 5200% |
source_file_offset | LongPoint | 307,171,504 | 100% | 28,000 | 28,000 | 100% |
It is a mystery to me why crawl_date
and content_length
did not improve, especially considering that content_text_length
is so equal to content_length
. I will dig further into that.
One of the benefits of direct jumping to blocks & inside blocks is that the disk-cache is not stressed by the intermediate steps. This makes synthetic tests even worse than usual. Nevertheless, there is a performance probe skeleton to play with at TestIndexedDISI.testCacheSpeed
.
Kick the tires and let me know your thoughts. Non-responsiveness by me will probably be due to vacation.
Toke Eskildsen (@tokee) (migrated from JIRA)
I investigated further and found out that our fields crawl_date
and content_length
were both DENSE
as well as represented as "blocks of different bits per value" (see Lucene70DocValuesProducer.java
for details). It turns out that the same principle with iteration is used as in IndexedDISI
, so it was possible to apply a variation of the same solution; namely jump tables.
The old retrieval speed for these values were 100 doc/s. With the jump tables applied this increased to 30K docs/s, bringing all our fields up to Solr 4 speed again. Again: We use segments with 300M docs - a 300x speed-up will not be a common case.
I have updated the patch and added the temporary test TestDocValues#testNumericRetrievalSpeed
which demonstrates how retrieval speed for DENSE
longs of varying bits per value goes down as index size goes up and how it is largely unaffected when using jump tables. An interesting observation here is that without the jump tables, merging an index down to 1 segment means that retrieval of DV-fields becomes slower.
The current patch is still search-time only. The Lucene70 codec packs the content really well & with minimal memory overhead and as the jump-tables can be represented as data in the segment files with (qualified guess) minimal search-performance overhead, it seems possible to arrive at a solution that is both space-efficient & fast for large segments.
The patch applies to Lucene/Solr 8 and the caching is enabled per default. It can be used directly, although it does spam stdout
. The switch IndexedDISICacheFactory.DEBUG
controls the spamming.
Toke Eskildsen (@tokee) (migrated from JIRA)
Patch backported to 7.4, for anyone who wants to try it there.
Toke Eskildsen (@tokee) (migrated from JIRA)
Patch backported to 7.3, for anyone who wants to try it there.
Toke Eskildsen (@tokee) (migrated from JIRA)
The patch implements the jump list suggested in #8640.
Toke Eskildsen (@tokee) (migrated from JIRA)
The master-patch is now updated with sparse representation of the DENSE
rank-data (the number of indirections is getting pretty high here - I wonder if some shortcutting is possible?). This means all the caching structures are now compact and a check against one of our web archive shards gave these statistics:
Index: 890G / 307M docs
Fields: Total=59 (blocks: ALL=107374, SPARSE=33427, DENSE=111803, EMPTY=5119), vBPV(long)=4
Overhead: 33198KB, 12 seconds start up
I expect to run some comparative tests on our full web archive pretty soon, but the results will mostly be relevant for other people with large (100GB+) segments. Due to reasons it is not easy for me to run tests on our lesser-scale indexes, but I'll see if I can figure out a way.
Adrien Grand (@jpountz) (migrated from JIRA)
The above numbers are very impressive! Do you have a sense of how much is contributed by the lookup table for block offsets, and how much is contributed by the additional rank data-structure for DENSE blocks?
The current patch is still search-time only. The Lucene70 codec packs the content really well & with minimal memory overhead and as the jump-tables can be represented as data in the segment files with (qualified guess) minimal search-performance overhead, it seems possible to arrive at a solution that is both space-efficient & fast for large segments.
+1 to compute these data-structures at index time and have little search-time memory overhead. That should come for free for the rank data-structure of DENSE blocks which we should be able to encode in the header of DENSE blocks. Regarding the lookup table, I guess we'll need to turn it into a skip list, but that shouldn't be a major problem either?
Toke Eskildsen (@tokee) (migrated from JIRA)
My previous stats script was slightly off. I isolated the block / DENSE-structures and got
Index: 890GB / 51 fields / 307M docs
Blocks: Total 238971 (blocks: ALL=102693, SPARSE=28739, DENSE=102421, EMPTY=5118) / 1866KB
DENSE-Rank: 28118KB (raw DENSE data: 102421*8KB = 819368KB)
BPV(long): 7 / 654KB
Total: 30653KB, 13 seconds start up
DENSE is by far the biggest consumer of cache-space in our setup. Interestingly enough, the vBPV-caching was the ones that gave us by far the biggest benefit, for the few long fields that we have.
I looked at skip lists, both in the MultiLevelSkipListWriter and [on Wikipedia|[https://en.wikipedia.org/wiki/Skip_list].] As I understand it they are essentially a rank+select
implementation that allows for varying-size skips and works well with a linked list that can be modified. The varying-size and mutability does not seem to be used/relevant for Lucene.
What I don't really understand is the benefit of skip list's multi-level approach in this case. How would a skip list be better than the current direct-lookup in the array of longs representing offset+bitcount? If the point is to save further memory, the block-offsets could be stated for every 4th block or so, just as the skip lists does. But the current overhead of 2MB for a rather large segment does not seem problematic to me and it does mean that 0 superfluous blocks needs to be seeked.
New point: I would very much like to make this issue two-step.
As the performance regression gets worse linear with segment size, it seems plausible that the people that will benefit the most from the patch are also people where a full re-index is not trivial. From my point of view, search-time caching should be present for present segments, independently of codec-changes to future segments. The current patch needs polishing, but is functionwise ready and does exactly this.
As the search-time cache is non-destructive, rolling this as step 1 would be a conservative update with easy rollback. Step 2 is of course to change the codec to embed the caching structures, if they prove their worth.
Toke Eskildsen (@tokee) (migrated from JIRA)
The logic for selecting sparse representation when packing the DENSE structure was switched, resulting in needless memory overhead. Fixed with latest patch.
Toke Eskildsen (@tokee) (migrated from JIRA)
Some independent verification/debunking of this performance issue would be highly appreciated.
The patch speeds up DocValues
-retrieval and the larger the segments, the more the benefit. An index with 50M+ documents, using DocValues
when returning documents (preferably with one or more int/long/date-fields), would be ideal. Checking that it does not hurt performance for smaller indexes also seems important.
I'll port the patch to any Lucene/Solr 7.x-version if it helps.
Adrien Grand (@jpountz) (migrated from JIRA)
Indeed I mentioned skip lists because this is the natural implementation for a forward-only iterator as it requires very little memory. You are right though that given the size of the blocks, memory usage of a long[] is reasonable in spite of the linear dependency on maxDoc, especially if we only index every N-th block.
it seems plausible that the people that will benefit the most from the patch are also people where a full re-index is not trivial
Reindexing is not necessary, users can use IndexUpgrader if they want to upgrade to the latest index formats. I'd rather work directly on a new doc values format with faster advancing.
Checking that it does not hurt performance for smaller indexes also seems important.
I'm curious about the impact on faceting or sorting on the entire index too. Since this patch adds more conditions to speed up advancing by large intervals, I'd expect it to also slow down usage patterns that consume most documents a bit? Hopefully by very little.
Toke Eskildsen (@tokee) (migrated from JIRA)
Regarding overhead:
The advance
/advanceExact
for blocks in IndexedDISI
still has the old check for whether the current block is the right one: https://github.com/tokee/lucene-solr/blob/lucene8374/lucene/core/src/java/org/apache/lucene/codecs/lucene70/IndexedDISI.java#L176-L219 The difference between skipping to the next block using the old blockEnd
and skipping by looking up the offset in the long[]
is tiny, but I guess the case "the wanted block is the block immediately after the current one" could be micro-optimized.
advanceWithinBlock
and advanceExactWithinBlock
are only modified for DENSE
. It always involves a call to rankSkip. Looking at the code, I see that it stupidly does not take advantage of the current IndexedDISI-state and always does the somewhat costly rank-calculations and seek, even when it would be faster to just go ahead the old way. I will fix that. Thanks for bringing it to my attention.
The variable bits per value (vBPV) structures used by numerics are equivalent to the DISI-blocks: If the current vBPV-block is the right one, there is no overhead. Getting a subsequent vBPV-block offset is a simple lookup in a long[]
. As with DISI-blocks it could be optimized for the vBPB-block following immediately and it might be worth it if the offset-cache is read from index data and not form memory. It is a very simple optimization, so I'll just do it.
I'll think about search-time vs. index-time a bit and come back to you on that.
Toke Eskildsen (@tokee) (migrated from JIRA)
Added patch against master with the performance optimization for sequential access discussed earlier in this JIRA. The patch also fixes a critical bug for Lucene70NormsProducer
in the previous patch, where the IndexedDISICache
could in some cases be shared between fields.
Toke Eskildsen (@tokee) (migrated from JIRA)
Fixes edge-case bug when the last 16384 values for a numeric DocValues-field all had value 0. Adds a temporary Solr-hack for easy toggling of lucene-8374 caches for easy performance comparisons.
The caches to use can be controlled by adding lucene8374=<caches>
to the Solr REST-calls, where <caches>
is all, some or none of [norm, block, dense, vbpv, debug]
, e.g. ...select?q=foo&wt=json&lucene8374=block,debug
. They can (of course) also be controlled when using Lucene directly through the class IndexedDISICacheFactory
.
Tim Underwood (@tpunder) (migrated from JIRA)
I just tested this out and am seeing a really good Solr faceting performance increase! One of the facet heavy queries that I've been using for testing on SOLR-12878 and SOLR-12882 went from \~88 queries per second to \~190 queries per second!
Index Stats:
Total Docs: 83,846,867 (\~1 million parent docs and the rest are child documents)
Index Size (on disk): 30.53 GB
Query Stats:
The query I'm using performs JSON facets against the child documents on 41 IntPoint fields that makes up one of the queries used to generate this page: https://www.opticatonline.com/search?bv=18220®ion=usa . The facets are shown on the left side of the page. The "Part Type" (single valued) and "Assembly Group" (multi-valued) are populated for every document. The "Linkage Criteria" make up the rest of the 41 fields and are sparsely populated with data (if at all).
I can share more details if interested.
When I get a chance I'll test this on my other large index that has less overall documents (8 million total) but they are all densely populated with multi-valued string fields that are used for faceting.
Toke Eskildsen (@tokee) (migrated from JIRA)
That is great news, @tpunder. Thank you very much for testing!
I was pleasantly surprised about your relatively large performance increase as my own tests had a more modest increase for faceting. Visiting your site, I see that you faceting is much richer than my simple facets, which would explain it. I guess that my base response time (query match + logistics) is relatively large - I should try and make a test with more facets (and one with less) to probe that.
I doubt that you will see much difference with your 8M docs index, but I would love to be proven wrong on that.
Could you tell me how much of your index data are memory-cached?
Anyway, your test verifies that the patch is relevant for other real-world setups, so I'll try and move forward with it. Thanks again.
Toke Eskildsen (@tokee) (migrated from JIRA)
Ported to Lucene/Solr 7.5.
Toke Eskildsen (@tokee) (migrated from JIRA)
Ported to Lucene/Solr 7.3.
David Smiley (@dsmiley) (migrated from JIRA)
Toke, the JIRA "fix version" should reflect the Git branches you commit to, or plan to commit to. And 7.5 and 7.3 are not branches you are allowed to commit to since this is not a bug fix. Once you do commit to branch_7x, you can put a fix version of "7.6" if 7.6 release branch hasn't been created yet. If the release branch has been created, you can do 7.7 even if 7.7 is never ultimately released. We don't have a "trunk" but we do have a "master". BTW I don't bother assigning "master" to fix version if I'm also going to commit to another branch, since it's implied. In other words, it's implied that if the fix version is 7.6 that the feature/fix will be fixed in 8 as well.
Toke Eskildsen (@tokee) (migrated from JIRA)
@dsmiley the idea for making patches for older versions was to make it easier to measure its effect on existing setups: Testing an existing Solr 7.3 vs. a patched Solr 7.3 is much cleaner than testing a Solr 7.3 vs. patched master.
If that collides with the established workflow here, can you suggest how I can support easy testing?
David Smiley (@dsmiley) (migrated from JIRA)
Feel free to post version specific patches and/or create feature branches if it suites you. I'm just telling you how to use the JIRA "fix version" field; that's all. I know you're new to community development here so I'm just trying to help out.
Tim Underwood (@tpunder) (migrated from JIRA)
@tokee You are correct I have not seen any noticeable performance differences on my index with 8M docs. However, I've been looking into converting it into parent/child docs which would expand it to \~300 million total docs so I suspect this patch would help out performance in that case.
I've included some updated numbers from re-running my tests and making use of the lucene8374 url parameter to enable/disable the caches.
I've been running my informal benchmarking on my laptop which has 32GB of RAM. Activity Monitor reports 7.5GB of "Cached Files" and I see very little disk activity when running my tests so I suspect everything needed for faceting is in memory.
Index Size: 35.41 GB
Total Documents: 84,159,576
Documents matching query: 3,447
lucene8374 Caches | Requests per second |
---|---|
All Disabled | 88/second |
All Enabled | 266/second |
Note: I'm using Apache Bench (ab) for this test with a static query, concurrency of 10, and 5,000 total requests. This is really just testing the performance of calculating the facets since the set of matching documents should be cached by Solr.
Index Size: 35.41 GB
Total Documents: 84,159,576
Documents Matching Query: 73,241,182
lucene8374 Caches | Time for Single Request |
---|---|
All Disabled | \~117 seconds |
All Enabled | \~117 seconds |
Note: This test is so slow that I only run one query at a time. This is NOT an actual use case for me. I was just curious if there was any performance difference.
Toke Eskildsen (@tokee) (migrated from JIRA)
@dsmiley I think I got it now, thanks. I was confused by the upload-dialog, where the Fix version is available: I thought I should set it to reflect the concrete upload (e.g. a 7.3 patch for testing) rather than the overall issue. I will adjust it to 7.6 with master implied. I will also consider making a feature branch: If this it to be part of the codec instead of search-time only, I'll probably need assistance to make it happen. Thank you for your help.
@tpunder Thank you for the follow-up. I would expect a parent/child structure with 300M documents to fit extremely well with the patch. Your new information supports that there are non-trivial gains even when everything is disk cached (yay!) and that small result sets benefits the most (expected, but nice to verify).
As for the "Faceting on most of the index", I guess all the time is spend on calculating the facets and that any speed-up for the retrieval of the facet values themselves are insignificant compared to the overall time. While the patch is also relevant for the calculation phase (to determine the ordinals for the facet values for the documents), a match for nearly all documents means a sequential iteration with extremely small gaps where the jump tables are not used.
It is nice to see that there is no penalty for having the optimizations enabled, though strictly speaking one needs to compare unpatched vs. patched rather than patch disabled/enabled to be sure that the if (patch_enabled) ...
-check does not in itself make a difference (and to be sure I did not mess up elsewhere). But measuring tiny differences between individual runs is of course quite hard.
Tim Underwood (@tpunder) (migrated from JIRA)
For what it's worth my original testing was with unpatched vs patched.
I did some quick YourKit CPU sampling on my 2 tests with cached enabled/disable. I've excluded "[Wall Time] java.lang.Thread.sleep(long)" which is why the "CPU (What-If)" tab is active.
I should also note that all of my testing (both caches enabled and disabled) has been done with 3 other patches applied (in addition to this one): SOLR-12875 (ArrayIndexOutOfBoundsException in JSON Facets), SOLR-12878 (large performance issue in FacetFieldProcessorByHashDV) and SOLR-12882 (lambda memory allocation removal in FacetFieldProcessorByHashDV.collectValFirstPhase).
Note: Image thumbnails do not seem to be working right so I've included the file name so you can find the right image attachment that goes with the smaller resized image.
image-2018-10-24-07-30-06-663.png
image-2018-10-24-07-30-56-962.png
start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png
start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png
Toke Eskildsen (@tokee) (migrated from JIRA)
@tpunder Profiling, nice! Must. Resist. Urge. To. Look. At. MonotonicDocValues
:)
The differences for the profiles for "Faceting on most of the index" seems like jitter to me. Good to verify that.
"Faceting on data for a single vehicle" has multiple interesting measurements:
FieldCacheImpl#Cache.get
takes 68,485ms for cached and only 4,792ms for non-cached. The LUCENE-8374 caches are lazy-created so I would expect the start-up overhead for them to be paid under SparseNumericDocValues.advanceExact
or similar, but not as part of requesting the cache. I wonder why there is such a difference?SparseNumericDocValues.advanceExact
is reduced by a factor 6 with caching. This factor should scale to the number of documents in each segment: If you try the 300M parent/child structure and the number of segments stays about the same, I would expect the factor to increase to 15-20.IndexedDISI.advanceBlock
+ IndexedDISI.rankSkip
for cached is 46,540ms and IndexedDISI.advanceBlock
is 79,185ms for non-cached, telling me that the BLOCK-cache is a noticeable win in itself. Glad to see that as my artificial test showed a lesser win.IndexedDISI.advanceExactWithinBlock
is not visible in any of the profiles. Could it be that practically all your documents has values defined for the fields used for faceting?In the current patch, statistics for LUCENE-8374 are written to stdout, which Solr collects in the logfile solr-XXXX-console.log
. If possible, I would like to see the content of that file as it contains a break-down of the DocValues-fields, thanks.
Tim Underwood (@tpunder) (migrated from JIRA)
@tokee I've attached the logs for both the single vehicle case and the entire index case. For each case I started up the server, ran a single request (either the single vehicle or entire index query) and then shutdown the server.
I haven't had a chance to dig into the CPU sampling results or your comments about them yet. But I will :) . If I find some time I'll run some non-sampled CPU profiling and maybe some memory allocation profiling too :) . Speeding up anything related to faceting would be a win for most of my use cases.
Toke Eskildsen (@tokee) (migrated from JIRA)
@tpunder it might be that it is because of sampling that the advanceExactWithinBlock
does not show up. Don't fret about it - I'll try some profiling experiments myself to see what's up with that.
Your logs are very informative, thanks: {{ Segments: 41 Fields: Total=41, vBPV(long)=30 (blocks: 128095) DISI-blocks: ALL=41, DENSE=76793, SPARSE=123, EMPTY=0 Overhead: 20.98 MB, 412 ms startup }}
The patch introduces jumps for DISI-blocks, DENSE and vBPV. As most of your fields has this exact combination, your index is very receptive to the optimizations. Good for you and a caveat that this level of speed-up is not to be expected generally.
There is a lot of numerics to a front-end that uses string-facets. I guess you represent the facet entries as IDs and look up the string representations from another source? Is this because you found string faceting to be too slow? I tried looking at DocValues string resolving but there did not seem to be any easy gains like the one for numerics.
Tim Underwood (@tpunder) (migrated from JIRA)
Yes, for that index I think almost everything[1] is indexed as Int IDs and then the entities they represent are looked up and converted to strings before being displayed on the front end. I don't think I ever considered or tried String fields for those since the Int IDs are a natural fit. We also have a few cases where we attempt to apply translations to things like the Part Type dropdown so we need to know the ID anyways (e.g. https://www.opticatonline.com/search?bv=18220®ion=usa&lang=es-MX).
My other index[2] (with \~8 million docs) makes more use of String fields but that is mostly due to not using parent/child docs and needing to make use of facet prefix filtering to match values for a specific vehicle. For example a value might look like "5411/1004" where "5411" represents the id of the vehicle I'm filtered to and "1004" represents the type of part. If I ever convert that index to parent/child docs then I could convert a lot of those fields to ints.
[1] - The Brands are actually indexed as their 4 character ID string (e.g. "BBHK" for the brand "Bosch")
[2] - I don't think I have any good non-login protected examples of this index. This one has a very limited view on the data (if you have a non North American country selected): https://propartsnet.opticatonline.com/search?ltt=pc<id=5411&lcc=DK&cc=DK&lang=da&bn=100019 . It works very similar to the www.opticatonline.com site except uses different underlying data for the non US/MX/CA countries.
Toke Eskildsen (@tokee) (migrated from JIRA)
We've used the patch in production for a few months and found zero issues. @tpunder was kind enough to provide independent verification of its impact, for a case where the jump-tables are activated as well as one where they aren't (to check for performance regression). To me this seems solid enough to move forward.
I understand the recommendation from @jpountz to bake this into the DocValues codec. From my perspective, it makes sense to also make a search-time version for the current codec, which is what the current patch does. I have a few reasons for that
Unless there is direct opposal to adding the search-time patch to Lucene/Solr master (with a back-port to 7), I will go ahead with this process. For that I need some help.
David Smiley (@dsmiley) (migrated from JIRA)
#1
but I have a stronger preference to removing entirely as I imagine the option would be less self contained (i.e. yes would add clutter)Thanks for the contribution Toke! The sooner this gets committed, the sooner Jenkins starts beating on it.
Uwe Schindler (@uschindler) (migrated from JIRA)
Instead of printing messages to stdout, you may add statistics using the Lucene-internal logging of IndexWriter. But this should still be reduced to important stuff that might be helpful during debugging the indexing chain and for endusers who want to see merges. This could be an option, but I do not know how verbose it is now.
If you look around other codecs, most debugging output is just commented out.
Tim Underwood (@tpunder) (migrated from JIRA)
@tokee Here is a delayed follow up on your FieldCacheImpl#Cache.get observations:
At first I was a little confused why the FieldCache was showing up at all since I have docValues enabled on almost everything in order to avoid the uninverting. However looking at the Solr cache stats page shows the root field showing up in the field cache. That makes sense since I don't have docValues=true specified for it and also since I'm requesting the "uniqueBlock(root)" count for each of my facet fields (since I only care how many parent documents match and not how many children).
Anyways.. as to why it shows up in the cpu sampling as taking so much time my best guess is that it has something to do with the synchronized blocks in FieldCacheImpl#Cache.get. As an experiment (which ignores the weak keys) I swapped out the WeakHashMap<> (which uses nested HashMaps) for a ConcurrentHashMap with nested ConcurrentHashMaps in order to allow me to get rid of the synchronized blocks. After doing that FieldCacheImpl#Cache.get disappeared from the CPU sampling. There may have been a minor performance increase but it certainly wasn't close to the 68,485ms that showed up on the original profiling. So it might have just been an artifact of the interaction between the cpu sampling and the synchronized blocks.
Perhaps I'll go back and play with it some more and try swapping in Guava's Cache in order to make the weak keys work properly. Or maybe I'll try enabling docValues on my root field to see what that does.
Toke Eskildsen (@tokee) (migrated from JIRA)
Thank you, @dsmiley & @uschindler. The jump-tables are field-oriented, so the amount of output is currently #segments * #DocValue_fields * #reopens = verbose
. Much too fine-grained from what Uwe describes. I'll remove it all.
Same goes for the options for enabling & disabling the caches. Should it be relevant at a later point, that part is quite easy to re-introduce.
Toke Eskildsen (@tokee) (migrated from JIRA)
Cleaned patch (no output, no on/off switching) as suggested by @uschindler and @dsmiley. The patch applies to master as per 2018-11-07 and passes ant precommit
, ant test
and ant test-forbidden-apis
. The core code is the same as LUCENE-8374.patch so anyone wanting to play with turning caching on/off should use that instead.
This patch is what I hope to commit to master
. Or rather: This is the first version of what I hope to commit, as I would be grateful if anyone would take a look at it and tell me if I messed up somewhere. I am not too worried about functionality and Jenkins helps there. It is mostly architecturally where I am on shaky ground.
Lucene/Solr QA (migrated from JIRA)
✔ +1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
master Compile Tests | |||
+1 | compile | 0m 32s | master passed |
Patch Compile Tests | |||
+1 | compile | 0m 30s | the patch passed |
+1 | javac | 0m 30s | the patch passed |
+1 | Release audit (RAT) | 0m 30s | the patch passed |
+1 | Check forbidden APIs | 0m 30s | the patch passed |
+1 | Validate source patterns | 0m 30s | the patch passed |
Other Tests | |||
+1 | unit | 13m 36s | core in the patch passed. |
16m 55s |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8374 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12947230/LUCENE-8374.patch |
Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
uname | Linux lucene1-us-west 4.4.0-137-generic #163\~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / 1b084db |
ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 |
Default Java | 1.8.0_191 |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/120/testReport/ |
modules | C: lucene/core U: lucene/core |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/120/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
Toke Eskildsen (@tokee) (migrated from JIRA)
I have cleaned up the previous patch (removed dead code, out-commented lines etc.). Reviewing 2000+ patch-lines is a daunting task, so instead of trying to get anyone to commit to that, let me present the different parts - if anything sounds odd, don't hesitate to comment. I plan to push it to master in the near future.
2 new utility classes are used for creating and accessing compact representations of sparse arrays of whole numbers:
RankBitSet.java
uses a rank function to provide O(1) rank (the number of set bits) for a given index in a bitmap. It is hard to grok, but there is not much active code and it was used previously in SOLR-5894, which we ran in production for 2 years.LongCompressor
is basically glue around the greatest common divisor code from Lucene70DocValuesConsumer
and RankBitSet.java
. There's a monkey test as well as an explicit sparse-oriented test for that. The code seems solid to me.3 new classes are used to hold the generated jump-tables:
IndexedDISICacheFactory.java
is a simple pool holding all the jump-tables for a given segment. Each Lucene70DocValuesProducer.java
creates one of these when it itself is created and consequently releases it when the segment is discarded. The jump-table-structures could also have been stored along with general caching, but as they are always-on and as it makes little sense to try and adjust size or other parameters for them, I chose to make them integral to Lucene70DocValuesProducer.java
.IndexedDISICache.java
builds & holds the jump-tables and index-tables for block-jumps in IndexedDISI as well as the rank for DENSE blocks. Both representations are in principle fairly simple, but it is a bit tricky to review all the shifts & masks.VaryingBPVJumpTable
is a simple jump-table for numerics with varying bits per value. Building it is very similar to looking up a value in the last vBPV-block.2 classes are changed to use the caches:
Lucene70NormsProducer.java
only uses the IndexedDISICache
and the changes are quite trivial.Lucene70DocValuesProducer.java
has the same straight forward use of IndexedDISICache
as the norms producer, but also uses the VaryingBPVJumpTable
where relevant. To avoid redundant code, the inner helper class VaryingBPVReader
handles the general task of reading from a vBPV-slice in a cache-aware manner.Lucene/Solr QA (migrated from JIRA)
✔ +1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
+1 | test4tests | 0m 0s | The patch appears to include 3 new or modified test files. |
master Compile Tests | |||
+1 | compile | 1m 26s | master passed |
Patch Compile Tests | |||
+1 | compile | 0m 19s | the patch passed |
+1 | javac | 0m 19s | the patch passed |
+1 | Release audit (RAT) | 0m 19s | the patch passed |
+1 | Check forbidden APIs | 0m 19s | the patch passed |
+1 | Validate source patterns | 0m 19s | the patch passed |
Other Tests | |||
+1 | unit | 30m 26s | core in the patch passed. |
37m 1s |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8374 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12947631/LUCENE-8374.patch |
Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / 42ee966 |
ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
Default Java | 1.8.0_191 |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/121/testReport/ |
modules | C: lucene lucene/core U: lucene |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/121/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
Adrien Grand (@jpountz) (migrated from JIRA)
I'd really like to bake it into the codec and avoid computing the cache dynamically. If someone really needs this feature with the current codec, they could make a custom build that applies your patch?
As far as reviewing is concerned, maybe it would help to split the improvements to skipping blocks and skipping over blocks into two different patches (or commits)? That might help dig test failures or performance issues in the future as well since git bisect would point to a smaller commit.
Toke Eskildsen (@tokee) (migrated from JIRA)
@jpountz I do not understand your resistance against providing an out-of-the-box performance improvement to Lucene-based search engines. I do not see index-upgrades as trivial exercises and I have previously encountered quite a lot of resistance against trying patched versions of Lucene/Solr. As a developer it is probably second nature to you. For a lot of users, not so much.
Your patch-splitting suggestion is interesting. The three parts [DISI-blocks, dense and variable bits per value] could reasonably easy be split into [DISI-blocks, dense] and [variable bits per value]. Or they could not-so-easily be split into all 3 parts. They would be additive though (as opposed to independent), as they share a lot of structure. If splitting helps acceptance, I'll try and commit that way. I'm also game to make a split patch, if someone commits to do a review.
Note: I am aware that Lucene/Solr 7.6 is in pre-release and that larger patches should not be committed to master until 7.6 has been released.
Adrien Grand (@jpountz) (migrated from JIRA)
I do not understand your resistance against providing an out-of-the-box performance improvement to Lucene-based search engines.
I have nothing against making performance better, quite the opposite! But as often there are trade-offs and this patch spends significant CPU and I/O when opening the reader in order to build a cache while the same information could be computed at index time. This might be fine for static readers, but it doesn't sound like a net win otherwise.
I do not see index-upgrades as trivial exercises and I have previously encountered quite a lot of resistance against trying patched versions of Lucene/Solr.
Sure, I agree patched versions have lots of drawbacks. I was suggesting this workaround in the case upgrading was a no-go given that this is more acceptable to me than adding such caching to the codec.
If splitting helps acceptance, I'll try and commit that way.
Thanks. I'm mostly interested in splitting from the perspective of tracking regressions. For instance it will be easier to track which change a slowdown/speedup relates to in Mike's nightly benchmarks.
I'm also game to make a split patch, if someone commits to do a review.
I'll take care of reviews.
Note: I am aware that Lucene/Solr 7.6 is in pre-release and that larger patches should not be committed to master until 7.6 has been released.
I don't think this is an issue. Nick already cut branch_7_6, so the 7.6 release shouldn't impact work that we are doing on the master and 7.x branches?
Toke Eskildsen (@tokee) (migrated from JIRA)
this patch spends significant CPU and I/O when opening the reader
This patch spends the IO equivalent of making a single worst-case search matching 1/16384th of the documents in the newly opened segment, sorting on a numeric DocValues field, for each used DocValued field. I am hard pressed to imagine a setup where that IO-penalty wouldn't pay off very quickly.
As for CPU it does perform a compression on the DENSE data structure, to lower the memory overhead. In Tim's setup, I could not get the timing for his smaller segments as they took less than 1ms. For his largest segment (19M documents) it took a total of 23ms across 41 DocValues fields. For Tim his startup overhead (a total of 412ms across all segments & fields) was paid off in less than a second of operation.
Based on this, I disagree on your assertion of "significant" performance overhead for search-time. It might turn out to be significant when applied broadly or tested in depth, but at this point neither theory nor observations suggests so.
For instance it will be easier to track which change a slowdown/speedup relates to in Mike's nightly benchmarks.
As discussed on Mike's GitHub space, his nightly benchmarks are not really tracking this part (large-jump access of DocValues) of Lucene. If they were, an alarm would have been raised on the switch to the iterative API.
It is quite understandable that such tests are missing, as large indexes requires a lot of computing resources and/or time. How to quickly catch such regressions in the future, without spending a lump sum on hardware, is a challenge that I have no ready answers for.
In its current form, Mike's benchmarks seems fine for tracking LUCENE-8374 regressions in the areas where there is no expectation of a speed-up, which is an important part of testing the value of an attempted optimization.
I'll take care of reviews.
Thank you, I appreciate that.
As for the wait-for-7.6, I just based it on the note in https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/ajax/%3CCABM%2Bu9JU%3DQFshN_juicOohU8dZQBtZ0Nc7Zs79_Wt9wZRjTVqA%40mail.gmail.com%3E about big changes. I am happy to classify this patch as a non-big change.
Toke Eskildsen (@tokee) (migrated from JIRA)
As suggested, I have divided the patch into 4 parts:
IndexedDISI
block skips (every 65536 docIDs, low impact)IndexedDISI
DENSE block (high impact)Lucene70NormsProducer
Collectively this is the same as the full LUCENE-8374 patch. The split is for easier review and to add the changes as 4 commits, for easier bisection in search of errors or other surprising behaviour.
I will let this sit here for a few days. If nothing happens, I plan to push to master to let Jenkins play with it.
ASF subversion and git services (migrated from JIRA)
Commit 58a7a8ada5cebeb261060c56cd6d0a9446478bf6 in lucene-solr's branch refs/heads/master from @tokee https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=58a7a8a
LUCENE-8374 part 1/4: Reduce reads for sparse DocValues
Offset and index jump-table for IndexedDISI blocks.
The
Lucene70DocValuesProducer
has the internal classesSparseNumericDocValues
andBaseSortedSetDocValues
(sparse code path), which again usesIndexedDISI
to handle the docID -> value-ordinal lookup. The value-ordinal is the index of the docID assuming an abstract tightly packed monotonically increasing list of docIDs: If the docIDs with corresponding values are[0, 4, 1432]
, their value-ordinals will be[0, 1, 2]
.Outer blocks
The lookup structure of
IndexedDISI
consists of blocks of 2^16 values (65536), where each block can be eitherALL
,DENSE
(2^12 to 2^16 values) orSPARSE
(< 2^12 values \~= 6%). Consequently blocks vary quite a lot in size and ordinal resolving strategy.When a sparse Numeric DocValue is needed, the code first locates the block containing the wanted docID flag. It does so by iterating blocks one-by-one until it reaches the needed one, where each iteration requires a lookup in the underlying
IndexSlice
. For a common memory mapped index, this translates to either a cached request or a read operation. If a segment has 6M documents, worst-case is 91 lookups. In our web archive, our segments has \~300M values: A worst-case of 4577 lookups!One obvious solution is to use a lookup-table for blocks: A long[]-array with an entry for each block. For 6M documents, that is < 1KB and would allow for direct jumping (a single lookup) in all instances. Unfortunately this lookup-table cannot be generated upfront when the writing of values is purely streaming. It can be appended to the end of the stream before it is closed, but without knowing the position of the lookup-table the reader cannot seek to it.
One strategy for creating such a lookup-table would be to generate it during reads and cache it for next lookup. This does not fit directly into how
IndexedDISI
currently works (it is created anew for each invocation), but could probably be added with a little work. An advantage to this is that this does not change the underlying format and thus could be used with existing indexes.The lookup structure inside each block
If
ALL
of the 2^16 values are defined, the structure is empty and the ordinal is simply the requested docID with some modulo and multiply math. Nothing to improve there.If the block is
DENSE
(2^12 to 2^16 values are defined), a bitmap is used and the number of set bits up to the wanted index (the docID modulo the block origo) are counted. That bitmap is a long[1024], meaning that worst case is to lookup and count all set bits for 1024 longs!One known solution to this is to use a [rank structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I [implemented it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]] for a related project and with that (), the rank-overhead for a
DENSE
block would be long[32] and would ensure a maximum of 9 lookups. It is not trivial to build the rank-structure and caching it (assuming all blocks are dense) for 6M documents would require 22 KB (3.17% overhead). It would be far better to generate the rank-structure at index time and store it immediately before the bitset (this is possible with streaming as each block is fully resolved before flushing), but of course that would require a change to the codec.If
SPARSE
(< 2^12 values \~= 6%) are defined, the docIDs are simply in the form of a list. As a comment in the code suggests, a binary search through these would be faster, although that would mean seeking backwards. If that is not acceptable, I don't have any immediate idea for avoiding the full iteration.I propose implementing query-time caching of both block-jumps and inner-block lookups for
DENSE
(using rank) as first improvement and an index-timeDENSE
-rank structure for future improvement. As query-time caching is likely to be too costly for rapidly-changing indexes, it should probably be an opt-in in solrconfig.xml.Some real-world observations
This analysis was triggered by massive (10x) slowdown problems with both simple querying and large exports from our webarchive index after upgrading from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but returning the top-10 documents takes 5-20 seconds (\~50 non-stored DocValues fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored vs. DocValues, so might not be directly comparable).
Measuring with VisualVM points to
NIOFSIndexInput.readInternal
as the hotspot. We ran some tests with simple queries on a single 307,171,504 document segment with different single-value DocValued fields in the fl and gotNote how both url and source_file_offset are very fast and also has a value for all documents. Contrary to this, content_type_ext is very slow and crawl_date is extremely slow and as they both have nearly all documents, I presume they are using
IndexedDISI#DENSE
. last_modified is also quite slow and presumably usesIndexedDISI#SPARSE
.The only mystery is crawl_year which is also present in nearly all documents, but is very fast. I have no explanation for that one yet.
I hope to take a stab at this around August 2018, but no promises.
Migrated from LUCENE-8374 by Toke Eskildsen (@tokee), 6 votes, resolved Dec 11 2018 Attachments: entire_index_logs.txt, image-2018-10-24-07-30-06-663.png, image-2018-10-24-07-30-56-962.png, LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_3.patch.20181005, LUCENE-8374_branch_7_4.patch, LUCENE-8374_branch_7_5.patch, LUCENE-8374_part_1.patch, LUCENE-8374_part_2.patch, LUCENE-8374_part_3.patch, LUCENE-8374_part_4.patch, LUCENE-8374.patch (versions: 7), single_vehicle_logs.txt, start-2018-10-24_snapshot_Users_tim_Snapshots-_YourKit_Java_Profiler_201702-b75-_64-bit.png, start-2018-10-24-1_snapshot_Users_tim_Snapshots-_YourKit_Java_Profiler_201702-b75-_64-bit.png Linked issues:
8640
9631