apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Reduce reads for sparse DocValues [LUCENE-8374] #9421

Closed asfimport closed 5 years ago

asfimport commented 6 years ago

The Lucene70DocValuesProducer has the internal classes SparseNumericDocValues and BaseSortedSetDocValues (sparse code path), which again uses IndexedDISI to handle the docID -> value-ordinal lookup. The value-ordinal is the index of the docID assuming an abstract tightly packed monotonically increasing list of docIDs: If the docIDs with corresponding values are [0, 4, 1432], their value-ordinals will be [0, 1, 2].

Outer blocks

The lookup structure of IndexedDISI consists of blocks of 2^16 values (65536), where each block can be either ALL, DENSE (2^12 to 2^16 values) or SPARSE (< 2^12 values \~= 6%). Consequently blocks vary quite a lot in size and ordinal resolving strategy.

When a sparse Numeric DocValue is needed, the code first locates the block containing the wanted docID flag. It does so by iterating blocks one-by-one until it reaches the needed one, where each iteration requires a lookup in the underlying IndexSlice. For a common memory mapped index, this translates to either a cached request or a read operation. If a segment has 6M documents, worst-case is 91 lookups. In our web archive, our segments has \~300M values: A worst-case of 4577 lookups!

One obvious solution is to use a lookup-table for blocks: A long[]-array with an entry for each block. For 6M documents, that is < 1KB and would allow for direct jumping (a single lookup) in all instances. Unfortunately this lookup-table cannot be generated upfront when the writing of values is purely streaming. It can be appended to the end of the stream before it is closed, but without knowing the position of the lookup-table the reader cannot seek to it.

One strategy for creating such a lookup-table would be to generate it during reads and cache it for next lookup. This does not fit directly into how IndexedDISI currently works (it is created anew for each invocation), but could probably be added with a little work. An advantage to this is that this does not change the underlying format and thus could be used with existing indexes.

The lookup structure inside each block

If ALL of the 2^16 values are defined, the structure is empty and the ordinal is simply the requested docID with some modulo and multiply math. Nothing to improve there.

If the block is DENSE (2^12 to 2^16 values are defined), a bitmap is used and the number of set bits up to the wanted index (the docID modulo the block origo) are counted. That bitmap is a long[1024], meaning that worst case is to lookup and count all set bits for 1024 longs!

One known solution to this is to use a [rank structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I [implemented it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]] for a related project and with that (), the rank-overhead for a DENSE block would be long[32] and would ensure a maximum of 9 lookups. It is not trivial to build the rank-structure and caching it (assuming all blocks are dense) for 6M documents would require 22 KB (3.17% overhead). It would be far better to generate the rank-structure at index time and store it immediately before the bitset (this is possible with streaming as each block is fully resolved before flushing), but of course that would require a change to the codec.

If SPARSE (< 2^12 values \~= 6%) are defined, the docIDs are simply in the form of a list. As a comment in the code suggests, a binary search through these would be faster, although that would mean seeking backwards. If that is not acceptable, I don't have any immediate idea for avoiding the full iteration.

I propose implementing query-time caching of both block-jumps and inner-block lookups for DENSE (using rank) as first improvement and an index-time DENSE-rank structure for future improvement. As query-time caching is likely to be too costly for rapidly-changing indexes, it should probably be an opt-in in solrconfig.xml.

Some real-world observations

This analysis was triggered by massive (10x) slowdown problems with both simple querying and large exports from our webarchive index after upgrading from Solr 4.10 to 7.3.1. The query-matching itself takes ½-2 seconds, but returning the top-10 documents takes 5-20 seconds (\~50 non-stored DocValues fields), up from ½-2 seconds in total from Solr 4.10 (more of a mix of stored vs. DocValues, so might not be directly comparable).

Measuring with VisualVM points to NIOFSIndexInput.readInternal as the hotspot.  We ran some tests with simple queries on a single 307,171,504 document segment with different single-value DocValued fields in the fl and got

 

Field Type Docs with value Docs w/ val % Speed in docs/sec
url String 307,171,504 100% 12,500
content_type_ext String 224,375,378 73% 360
author String 1,506,365 0.5% 1,100
crawl_date DatePoint 307,171,498 \~100% 90
content_text_length IntPoint 285,800,212 93% 410
content_length IntPoint 307,016,816 99.9% 100
crawl_year IntPoint 307,171,498 \~100% 14,500
last_modified DatePoint 6,835,065 2.2% 570
source_file_offset LongPoint 307,171,504 100% 28,000

 Note how both url and source_file_offset are very fast and also has a value for all documents. Contrary to this, content_type_ext is very slow and crawl_date is extremely slow and as they both have nearly all documents, I presume they are using IndexedDISI#DENSE. last_modified is also quite slow and presumably uses IndexedDISI#SPARSE.

The only mystery is crawl_year which is also present in nearly all documents, but is very fast. I have no explanation for that one yet.

I hope to take a stab at this around August 2018, but no promises.


Migrated from LUCENE-8374 by Toke Eskildsen (@tokee), 6 votes, resolved Dec 11 2018 Attachments: entire_index_logs.txt, image-2018-10-24-07-30-06-663.png, image-2018-10-24-07-30-56-962.png, LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_3.patch.20181005, LUCENE-8374_branch_7_4.patch, LUCENE-8374_branch_7_5.patch, LUCENE-8374_part_1.patch, LUCENE-8374_part_2.patch, LUCENE-8374_part_3.patch, LUCENE-8374_part_4.patch, LUCENE-8374.patch (versions: 7), single_vehicle_logs.txt, start-2018-10-24_snapshot_Users_tim_Snapshots-_YourKit_Java_Profiler_201702-b75-_64-bit.png, start-2018-10-24-1_snapshot_Users_tim_Snapshots-_YourKit_Java_Profiler_201702-b75-_64-bit.png Linked issues:

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for opening this one @tokee. I agree that the current encoding is not great, it was mostly a minor evolution from the previous dense format in order for things to work not too bad with sparse data, we should probably re-think it entirely. I know Robert suggested we should try to encode things more like postings, eg. with skip data.

Unfortunately this lookup-table cannot be generated upfront when the writing of values is purely streaming

Since this lookup table would be loaded into memory, we would typically write it into the meta file (extension: dvm) rather than into the data file (extension: dvd) so I don't think it would be an issue in practice.

It would be far better to generate the rank-structure at index time and store it immediately before the bitset (this is possible with streaming as each block is fully resolved before flushing), but of course that would require a change to the codec.

+1 FWIW I don't have any objections to modifying the current format for doc values, I think it's much needed. Happy to help if you need guidance.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Thank you for the suggestions, @jpountz. I am not at home in this part of the code base, so guidance is much appreciated. I will look into how postings work.

Putting the block-index in the separate meta file sounds like the right solution. What about also putting the rank structure in there too? That way the existing data file would keep its format and if a setup chooses not to use the lookup structures (time will show if that makes sense for any setups), its inactive existence will not affect disk caching of the data file. One downside I see is that it would require all the rank structures to be kept in-memory instead of being temporarily loaded as part of accessing a DENSE block.

On the other hand, offloading the support structures to the meta file ties very well into a two-stage approach where the first stage is query-time improvement for version 7 and the second is a codec-change for version 8. I must admit to having a selfish reason for trying to get version 7 to perform better: Our index takes 50 CPU-years / 3 realtime months to regenerate, so we would like not to do so.

What is the contract for the slicer? Is seeking strictly forward, so that a binary search for SPARSE can only be done by loading the full block temporarily onto the heap? It would be nice to also have an improvement for SPARSE, if possible.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

What about also putting the rank structure in there too?

In general the rule is that if you plan to have something in memory all the time then it should go to the meta file, and otherwise to the data file. We just need to be careful to not load too much stuff in memory in order to keep the index memory-efficient. Holding the block index in memory (one long every 65k docs) might be reasonable. I'm less sure about the long[32], though if I understand things correctly, having it in memory wouldn't help much compared to putting it at the beginning of the DENSE blocks and loading it only when we know we'll need at least one doc from this block.

For what it's worth, the codec API makes it very easy to deal with backward compatibility, so there would be no problem with completely changing the default doc-value format in a minor release. It doesn't have to wait for 8.0.

What is the contract for the slicer?

It may seek to any legal offset (positive and less than the file length), doesn't need to be strictly forward, could be backward.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Welcome back to Lucene dev @tokee :)

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Putting the rank-structure with the data would have the benefit that they would only be stored for the DENSE blocks: Putting it in meta would mean either unused entries for non-DENSE blocks or some sort of sparse representation, adding yet another complicated layer. And yes, if the rank is always used (which would be my guess), there would probably be very little performance difference between having it in-memory or together with the data that needs to be read anyway.

As for the rank structure itself, it is trade-off between space and lookups.

From an overhead/performance perspective, the 2560 bit version looks good to me. Unfortunately it does not align well with 65536 bits, so it's a bit messy compared to the very clean 2048 bit one. ...I am probably over-thinking it here. The difference between iterative and rank-based lookups is hopefully be apparent either way.

For what it's worth, the codec API makes it very easy to deal with backward compatibility, so there would be no problem with completely changing the default doc-value format in a minor release. It doesn't have to wait for 8.0.

This surprises me. A bit of a dangerous thing, is is not? No temporarily switching back to a known stable sub-version of Solr if a new release within the same major version turns out to have severe problems.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

@dsmiley thank you. Let's hope my activeness sticks this time. Why I have been inactive is probably not suitable for Jira :P

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

No temporarily switching back to a known stable sub-version of Solr if a new release within the same major version turns out to have severe problems.

This is something that is not guaranteed today, and we actually broke this on many minor releases.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Proof of concept

An unforeseen slump in firefighting at work means that I found time for this issue.

I implemented the block-lookup structure and the DENSE rank cache for Solr trunk (see attached patch). The patch also seems to apply cleanly to Solr 7.3.1. It uses a singleton to hold the caches, spams stdout with stats on cache creation and in general integreates poorly, but cashing-wise it should work well enough.

The implementation does not use a sparse structure to hold the rank cache, making its size linear to the number of documents. Using a sparse representation, this overhead can be near-linear to the number of DENSE blocks instead. Extending to change the index format would automatically make the rank representation linear to the number of DENSE blocks.

It forcibly enables itself for most IndexedDISI-usages by Lucene70DocValuesProducer, meaning that this patch is query-time only: See the stdout-log for the first-lookup-after-segment-open cache-building time overhead.

Implications

Looking up DocValues is (of course) used throughout Lucene/Solr, so I would expect the impact - if any - to be visible across a range of functions. Requesting large result sets with fields that are stored as DocValues is our prime point of hurt, but I guess that stats, facets & exports can also benefit.

The perfect fit for this patch is an index with large (millions of documents) segments with fields that are mostly DENSE (defined for <100% and > 6% of all documents).

Real-world observations

I re-ran the test from the JIRA-description and got the improvements below. I must stress that our setup is special with a segment of 300M documents, and that the fields with the largest improvements are DENSE, so for nearly everyone else, the benefits will be much lower. I will of course run tests on other indexes later, but I prioritized getting a patch out.

 

Field Type Docs with value Docs w/ val % Vanilla speed in docs/sec Cached speed in docs/sec cached/ vanilla
url String 307,171,504 100% 12,500 20,000 160%
content_type_ext String 224,375,378 73% 360 14,000 3800%
author String 1,506,365 0.5% 1,100 25,000 2200%
crawl_date DatePoint 307,171,498 \~100% 90 90 100%
content_text_length IntPoint 285,800,212 93% 410 22,000 5300%
content_length IntPoint 307,016,816 99.9% 100 90 90%
crawl_year IntPoint 307,171,498 \~100% 14,500 30,000 200%
last_modified DatePoint 6,835,065 2.2% 570 30,000 5200%
source_file_offset LongPoint 307,171,504 100% 28,000 28,000 100%

It is a mystery to me why crawl_date and content_length did not improve, especially considering that content_text_length is so equal to content_length. I will dig further into that.

One of the benefits of direct jumping to blocks & inside blocks is that the disk-cache is not stressed by the intermediate steps. This makes synthetic tests even worse than usual. Nevertheless, there is a performance probe skeleton to play with at TestIndexedDISI.testCacheSpeed.

Kick the tires and let me know your thoughts. Non-responsiveness by me will probably be due to vacation.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I investigated further and found out that our fields crawl_date and content_length were both DENSE as well as represented as "blocks of different bits per value" (see Lucene70DocValuesProducer.java for details). It turns out that the same principle with iteration is used as in IndexedDISI, so it was possible to apply a variation of the same solution; namely jump tables.

The old retrieval speed for these values were 100 doc/s. With the jump tables applied this increased to 30K docs/s, bringing all our fields up to Solr 4 speed again. Again: We use segments with 300M docs - a 300x speed-up will not be a common case.

I have updated the patch and added the temporary test TestDocValues#testNumericRetrievalSpeed which demonstrates how retrieval speed for  DENSE longs of varying bits per value goes down as index size goes up and how it is largely unaffected when using jump tables. An interesting observation here is that without the jump tables, merging an index down to 1 segment means that retrieval of DV-fields becomes slower.

The current patch is still search-time only. The Lucene70 codec packs the content really well & with minimal memory overhead and as the jump-tables can be represented as data in the segment files with (qualified guess) minimal search-performance overhead, it seems possible to arrive at a solution that is both space-efficient & fast for large segments.

The patch applies to Lucene/Solr 8 and the caching is enabled per default. It can be used directly, although it does spam stdout. The switch IndexedDISICacheFactory.DEBUG controls the spamming.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Patch backported to 7.4, for anyone who wants to try it there.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Patch backported to 7.3, for anyone who wants to try it there.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

The patch implements the jump list suggested in #8640.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

The master-patch is now updated with sparse representation of the DENSE rank-data (the number of indirections is getting pretty high here - I wonder if some shortcutting is possible?). This means all the caching structures are now compact and a check against one of our web archive shards gave these statistics:

Index: 890G / 307M docs Fields: Total=59 (blocks: ALL=107374, SPARSE=33427, DENSE=111803, EMPTY=5119), vBPV(long)=4 Overhead: 33198KB, 12 seconds start up

I expect to run some comparative tests on our full web archive pretty soon, but the results will mostly be relevant for other people with large (100GB+) segments. Due to reasons it is not easy for me to run tests on our lesser-scale indexes, but I'll see if I can figure out a way.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The above numbers are very impressive! Do you have a sense of how much is contributed by the lookup table for block offsets, and how much is contributed by the additional rank data-structure for DENSE blocks?

The current patch is still search-time only. The Lucene70 codec packs the content really well & with minimal memory overhead and as the jump-tables can be represented as data in the segment files with (qualified guess) minimal search-performance overhead, it seems possible to arrive at a solution that is both space-efficient & fast for large segments.

+1 to compute these data-structures at index time and have little search-time memory overhead. That should come for free for the rank data-structure of DENSE blocks which we should be able to encode in the header of DENSE blocks. Regarding the lookup table, I guess we'll need to turn it into a skip list, but that shouldn't be a major problem either?

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

My previous stats script was slightly off. I isolated the block / DENSE-structures and got

Index: 890GB / 51 fields / 307M docs Blocks: Total 238971 (blocks: ALL=102693, SPARSE=28739, DENSE=102421, EMPTY=5118) / 1866KB DENSE-Rank: 28118KB (raw DENSE data: 102421*8KB = 819368KB) BPV(long): 7 / 654KB Total: 30653KB, 13 seconds start up

DENSE is by far the biggest consumer of cache-space in our setup. Interestingly enough, the vBPV-caching was the ones that gave us by far the biggest benefit, for the few long fields that we have.

I looked at skip lists, both in the MultiLevelSkipListWriter and [on Wikipedia|[https://en.wikipedia.org/wiki/Skip_list].] As I understand it they are essentially a rank+select implementation that allows for varying-size skips and works well with a linked list that can be modified. The varying-size and mutability does not seem to be used/relevant for Lucene.

What I don't really understand is the benefit of skip list's multi-level approach in this case. How would a skip list be better than the current direct-lookup in the array of longs representing offset+bitcount? If the point is to save further memory, the block-offsets could be stated for every 4th block or so, just as the skip lists does. But the current overhead of 2MB for a rather large segment does not seem problematic to me and it does mean that 0 superfluous blocks needs to be seeked.

New point: I would very much like to make this issue two-step.

As the performance regression gets worse linear with segment size, it seems plausible that the people that will benefit the most from the patch are also people where a full re-index is not trivial. From my point of view, search-time caching should be present for present segments, independently of codec-changes to future segments. The current patch needs polishing, but is functionwise ready and does exactly this.

As the search-time cache is non-destructive, rolling this as step 1 would be a conservative update with easy rollback. Step 2 is of course to change the codec to embed the caching structures, if they prove their worth.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

The logic for selecting sparse representation when packing the DENSE structure was switched, resulting in needless memory overhead. Fixed with latest patch.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Some independent verification/debunking of this performance issue would be highly appreciated.

The patch speeds up DocValues-retrieval and the larger the segments, the more the benefit. An index with 50M+ documents, using DocValues when returning documents (preferably with one or more int/long/date-fields), would be ideal. Checking that it does not hurt performance for smaller indexes also seems important.

I'll port the patch to any Lucene/Solr 7.x-version if it helps.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Indeed I mentioned skip lists because this is the natural implementation for a forward-only iterator as it requires very little memory. You are right though that given the size of the blocks, memory usage of a long[] is reasonable in spite of the linear dependency on maxDoc, especially if we only index every N-th block.

it seems plausible that the people that will benefit the most from the patch are also people where a full re-index is not trivial

Reindexing is not necessary, users can use IndexUpgrader if they want to upgrade to the latest index formats. I'd rather work directly on a new doc values format with faster advancing.

Checking that it does not hurt performance for smaller indexes also seems important.

I'm curious about the impact on faceting or sorting on the entire index too. Since this patch adds more conditions to speed up advancing by large intervals, I'd expect it to also slow down usage patterns that consume most documents a bit? Hopefully by very little.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Regarding overhead:

The advance/advanceExact for blocks in IndexedDISI still has the old check for whether the current block is the right one: https://github.com/tokee/lucene-solr/blob/lucene8374/lucene/core/src/java/org/apache/lucene/codecs/lucene70/IndexedDISI.java#L176-L219 The difference between skipping to the next block using the old blockEnd and skipping by looking up the offset in the long[] is tiny, but I guess the case "the wanted block is the block immediately after the current one" could be micro-optimized.

advanceWithinBlock and advanceExactWithinBlock are only modified for DENSE. It always involves a call to rankSkip. Looking at the code, I see that it stupidly does not take advantage of the current IndexedDISI-state and always does the somewhat costly rank-calculations and seek, even when it would be faster to just go ahead the old way. I will fix that. Thanks for bringing it to my attention.

The variable bits per value (vBPV) structures used by numerics are equivalent to the DISI-blocks: If the current vBPV-block is the right one, there is no overhead. Getting a subsequent vBPV-block offset is a simple lookup in a long[]. As with DISI-blocks it could be optimized for the vBPB-block following immediately and it might be worth it if the offset-cache is read from index data and not form memory. It is a very simple optimization, so I'll just do it.

I'll think about search-time vs. index-time a bit and come back to you on that.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Added patch against master with the performance optimization for sequential access discussed earlier in this JIRA. The patch also fixes a critical bug for Lucene70NormsProducer in the previous patch, where the IndexedDISICache could in some cases be shared between fields.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Fixes edge-case bug when the last 16384 values for a numeric DocValues-field all had value 0. Adds a temporary Solr-hack for easy toggling of lucene-8374 caches for easy performance comparisons.

The caches to use can be controlled by adding lucene8374=<caches> to the Solr REST-calls, where <caches> is all, some or none of [norm, block, dense, vbpv, debug], e.g. ...select?q=foo&wt=json&lucene8374=block,debug. They can (of course) also be controlled when using Lucene directly through the class IndexedDISICacheFactory.

asfimport commented 6 years ago

Tim Underwood (@tpunder) (migrated from JIRA)

I just tested this out and am seeing a really good Solr faceting performance increase!  One of the facet heavy queries that I've been using for testing on SOLR-12878 and SOLR-12882 went from \~88 queries per second to \~190 queries per second!

 

Index Stats:

Total Docs: 83,846,867 (\~1 million parent docs and the rest are child documents)

Index Size (on disk): 30.53 GB

 

Query Stats:

The query I'm using performs JSON facets against the child documents on 41 IntPoint fields that makes up one of the queries used to generate this page:  https://www.opticatonline.com/search?bv=18220&region=usa . The facets are shown on the left side of the page.  The "Part Type" (single valued) and "Assembly Group" (multi-valued) are populated for every document.  The "Linkage Criteria" make up the rest of the 41 fields and are sparsely populated with data (if at all).

I can share more details if interested.

 

When I get a chance I'll test this on my other large index that has less overall documents (8 million total) but they are all densely populated with multi-valued string fields that are used for faceting.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

That is great news, @tpunder. Thank you very much for testing!

I was pleasantly surprised about your relatively large performance increase as my own tests had a more modest increase for faceting. Visiting your site, I see that you faceting is much richer than my simple facets, which would explain it. I guess that my base response time (query match + logistics) is relatively large - I should try and make a test with more facets (and one with less) to probe that.

I doubt that you will see much difference with your 8M docs index, but I would love to be proven wrong on that.

Could you tell me how much of your index data are memory-cached?

Anyway, your test verifies that the patch is relevant for other real-world setups, so I'll try and move forward with it. Thanks again.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Ported to Lucene/Solr 7.5.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Ported to Lucene/Solr 7.3.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Toke, the JIRA "fix version" should reflect the Git branches you commit to, or plan to commit to. And 7.5 and 7.3 are not branches you are allowed to commit to since this is not a bug fix. Once you do commit to branch_7x, you can put a fix version of "7.6" if 7.6 release branch hasn't been created yet. If the release branch has been created, you can do 7.7 even if 7.7 is never ultimately released. We don't have a "trunk" but we do have a "master". BTW I don't bother assigning "master" to fix version if I'm also going to commit to another branch, since it's implied. In other words, it's implied that if the fix version is 7.6 that the feature/fix will be fixed in 8 as well.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

@dsmiley the idea for making patches for older versions was to make it easier to measure its effect on existing setups: Testing an existing Solr 7.3 vs. a patched Solr 7.3 is much cleaner than testing a Solr 7.3 vs. patched master.

If that collides with the established workflow here, can you suggest how I can support easy testing?

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Feel free to post version specific patches and/or create feature branches if it suites you. I'm just telling you how to use the JIRA "fix version" field; that's all. I know you're new to community development here so I'm just trying to help out.

asfimport commented 6 years ago

Tim Underwood (@tpunder) (migrated from JIRA)

@tokee You are correct I have not seen any noticeable performance differences on my index with 8M docs.  However, I've been looking into converting it into parent/child docs which would expand it to \~300 million total docs so I suspect this patch would help out performance in that case.

I've included some updated numbers from re-running my tests and making use of the lucene8374 url parameter to enable/disable the caches.

I've been running my informal benchmarking on my laptop which has 32GB of RAM.  Activity Monitor reports 7.5GB of "Cached Files" and I see very little disk activity when running my tests so I suspect everything needed for faceting is in memory.

Test 1 - Faceting on data for a single vehicle (same test I previously ran)

Index Size: 35.41 GB

Total Documents: 84,159,576

Documents matching query: 3,447

 

lucene8374 Caches Requests per second
All Disabled 88/second
All Enabled 266/second

Note: I'm using Apache Bench (ab) for this test with a static query, concurrency of 10, and 5,000 total requests.  This is really just testing the performance of calculating the facets since the set of matching documents should be cached by Solr.

Test 2 - Faceting on most of the index

Index Size: 35.41 GB

Total Documents: 84,159,576

Documents Matching Query: 73,241,182

 

lucene8374 Caches Time for Single Request
All Disabled \~117 seconds
All Enabled \~117 seconds

 

Note: This test is so slow that I only run one query at a time.  This is NOT an actual use case for me.  I was just curious if there was any performance difference.

 

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

@dsmiley I think I got it now, thanks. I was confused by the upload-dialog, where the Fix version is available: I thought I should set it to reflect the concrete upload (e.g. a 7.3 patch for testing) rather than the overall issue. I will adjust it to 7.6 with master implied. I will also consider making a feature branch: If this it to be part of the codec instead of search-time only, I'll probably need assistance to make it happen. Thank you for your help.

 

@tpunder Thank you for the follow-up. I would expect a parent/child structure with 300M documents to fit extremely well with the patch. Your new information supports that there are non-trivial gains even when everything is disk cached (yay!) and that small result sets benefits the most (expected, but nice to verify).

As for the "Faceting on most of the index", I guess all the time is spend on calculating the facets and that any speed-up for the retrieval of the facet values themselves are insignificant compared to the overall time. While the patch is also relevant for the calculation phase (to determine the ordinals for the facet values for the documents), a match for nearly all documents means a sequential iteration with extremely small gaps where the jump tables are not used.

It is nice to see that there is no penalty for having the optimizations enabled, though strictly speaking one needs to compare unpatched vs. patched rather than patch disabled/enabled to be sure that the if (patch_enabled) ... -check does not in itself make a difference (and to be sure I did not mess up elsewhere). But measuring tiny differences between individual runs is of course quite hard.

asfimport commented 6 years ago

Tim Underwood (@tpunder) (migrated from JIRA)

For what it's worth my original testing was with unpatched vs patched.

I did some quick YourKit CPU sampling on my 2 tests with cached enabled/disable.  I've excluded "[Wall Time] java.lang.Thread.sleep(long)" which is why the "CPU (What-If)" tab is active.

I should also note that all of my testing (both caches enabled and disabled) has been done with 3 other patches applied (in addition to this one):  SOLR-12875 (ArrayIndexOutOfBoundsException in JSON Facets), SOLR-12878 (large performance issue in FacetFieldProcessorByHashDV) and SOLR-12882 (lambda memory allocation removal in FacetFieldProcessorByHashDV.collectValFirstPhase).

Note: Image thumbnails do not seem to be working right so I've included the file name so you can find the right image attachment that goes with the smaller resized image.

Test 1 - Faceting on data for a single vehicle

Caches Enabled:

 

image-2018-10-24-07-30-06-663.png

image-2018-10-24-07-30-06-663.png

Caches Disabled:

 

image-2018-10-24-07-30-56-962.png

image-2018-10-24-07-30-56-962.png

 

Test 2 - Faceting on most of the index

 

Caches Enabled:

start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png  

start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png

Caches Disabled:

start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png  

start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png

 

 

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

@tpunder Profiling, nice! Must. Resist. Urge. To. Look. At. MonotonicDocValues :)

The differences for the profiles for "Faceting on most of the index" seems like jitter to me. Good to verify that.

"Faceting on data for a single vehicle" has multiple interesting measurements:

In the current patch, statistics for LUCENE-8374 are written to stdout, which Solr collects in the logfile solr-XXXX-console.log. If possible, I would like to see the content of that file as it contains a break-down of the DocValues-fields, thanks.

asfimport commented 6 years ago

Tim Underwood (@tpunder) (migrated from JIRA)

@tokee I've attached the logs for both the single vehicle case and the entire index case.  For each case I started up the server, ran a single request (either the single vehicle or entire index query) and then shutdown the server.

I haven't had a chance to dig into the CPU sampling results or your comments about them yet.  But I will :) . If I find some time I'll run some non-sampled CPU profiling and maybe some memory allocation profiling too :) . Speeding up anything related to faceting would be a win for most of my use cases.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

@tpunder it might be that it is because of sampling that the advanceExactWithinBlock does not show up. Don't fret about it - I'll try some profiling experiments myself to see what's up with that.

Your logs are very informative, thanks: {{ Segments: 41 Fields: Total=41, vBPV(long)=30 (blocks: 128095) DISI-blocks: ALL=41, DENSE=76793, SPARSE=123, EMPTY=0 Overhead: 20.98 MB, 412 ms startup }}

The patch introduces jumps for DISI-blocks, DENSE and vBPV. As most of your fields has this exact combination, your index is very receptive to the optimizations. Good for you and a caveat that this level of speed-up is not to be expected generally.

There is a lot of numerics to a front-end that uses string-facets. I guess you represent the facet entries as IDs and look up the string representations from another source? Is this because you found string faceting to be too slow? I tried looking at DocValues string resolving but there did not seem to be any easy gains like the one for numerics.

asfimport commented 6 years ago

Tim Underwood (@tpunder) (migrated from JIRA)

Yes, for that index I think almost everything[1] is indexed as Int IDs and then the entities they represent are looked up and converted to strings before being displayed on the front end.  I don't think I ever considered or tried String fields for those since the Int IDs are a natural fit.  We also have a few cases where we attempt to apply translations to things like the Part Type dropdown so we need to know the ID anyways (e.g. https://www.opticatonline.com/search?bv=18220&region=usa&lang=es-MX).

My other index[2] (with \~8 million docs) makes more use of String fields but that is mostly due to not using parent/child docs and needing to make use of facet prefix filtering to match values for a specific vehicle.  For example a value might look like "5411/1004" where "5411" represents the id of the vehicle I'm filtered to and "1004" represents the type of part.  If I ever convert that index to parent/child docs then I could convert a lot of those fields to ints.

 

[1] - The Brands are actually indexed as their 4 character ID string (e.g. "BBHK" for the brand "Bosch")

[2] - I don't think I have any good non-login protected examples of this index.  This one has a very limited view on the data (if you have a non North American country selected):  https://propartsnet.opticatonline.com/search?ltt=pc&ltid=5411&lcc=DK&cc=DK&lang=da&bn=100019 . It works very similar to the www.opticatonline.com site except uses different underlying data for the non US/MX/CA countries.

asfimport commented 6 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

We've used the patch in production for a few months and found zero issues. @tpunder was kind enough to provide independent verification of its impact, for a case where the jump-tables are activated as well as one where they aren't (to check for performance regression). To me this seems solid enough to move forward.

I understand the recommendation from @jpountz to bake this into the DocValues codec. From my perspective, it makes sense to also make a search-time version for the current codec, which is what the current patch does. I have a few reasons for that

  1. We (the Royal Danish Library) really needs it. Well, we don't as we already use the patch, but figuratively we do: We have 70TB+ of indexes which are logistically hard to run through an index upgrader. I imagine other people with large indexes faces similar challenges. By supporting search-time jump-tables the upgrade becomes plug'n'play for those.
  2. It is practically free. The patch needs a clean-up, but unless I really messed up, this should be a minor thing.
  3. My previous “let's improve Solr performance”-project died the legacy-code death by becoming unwieldy; hard to review and hard to port. I would very much like to avoid a repeat and so I push for the minimum viable product approach.

Unless there is direct opposal to adding the search-time patch to Lucene/Solr master (with a back-port to 7), I will go ahead with this process. For that I need some help.

  1. What to do about messaging? The current patch spews statistics on stdout whenever a new segment is accessed. The Lucene codebase seems very silent, so should I just remove all that? Or can I tie into some sort of “enable debugging messages”-system?
  2. Currently the different jump-tables can be turned off and on on a global basis. This is extremely useful to evaluate the effect, but not something people would normally have a use for. I could just leave it as it does no functional harm, but my suspicion is that it is seen ad clutter in the code base?
  3. Review after I have fixed 1+2? It would be nice, but I know that it is also a heavy thing. I guess I'll leave it as a wish for some time and then go ahead with a merge if no one bites.
asfimport commented 5 years ago

David Smiley (@dsmiley) (migrated from JIRA)

  1. Use of stdout needs to either be removed, commented, or guarded by a static final boolean if you really want them to stay.
  2. Similar to #1 but I have a stronger preference to removing entirely as I imagine the option would be less self contained (i.e. yes would add clutter)

Thanks for the contribution Toke! The sooner this gets committed, the sooner Jenkins starts beating on it.

asfimport commented 5 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Instead of printing messages to stdout, you may add statistics using the Lucene-internal logging of IndexWriter. But this should still be reduced to important stuff that might be helpful during debugging the indexing chain and for endusers who want to see merges. This could be an option, but I do not know how verbose it is now.

If you look around other codecs, most debugging output is just commented out.

asfimport commented 5 years ago

Tim Underwood (@tpunder) (migrated from JIRA)

@tokee Here is a delayed follow up on your FieldCacheImpl#Cache.get observations:

At first I was a little confused why the FieldCache was showing up at all since I have docValues enabled on almost everything in order to avoid the uninverting.  However looking at the Solr cache stats page shows the root field showing up in the field cache.  That makes sense since I don't have docValues=true specified for it and also since I'm requesting the "uniqueBlock(root)" count for each of my facet fields (since I only care how many parent documents match and not how many children).

Anyways..  as to why it shows up in the cpu sampling as taking so much time my best guess is that it has something to do with the synchronized blocks in FieldCacheImpl#Cache.get.  As an experiment (which ignores the weak keys) I swapped out the WeakHashMap<> (which uses nested HashMaps) for a ConcurrentHashMap with nested ConcurrentHashMaps in order to allow me to get rid of the synchronized blocks.  After doing that FieldCacheImpl#Cache.get disappeared from the CPU sampling.  There may have been a minor performance increase but it certainly wasn't close to the 68,485ms that showed up on the original profiling.  So it might have just been an artifact of the interaction between the cpu sampling and the synchronized blocks.

Perhaps I'll go back and play with it some more and try swapping in Guava's Cache in order to make the weak keys work properly.  Or maybe I'll try enabling docValues on my root field to see what that does.

 

 

asfimport commented 5 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Thank you, @dsmiley & @uschindler. The jump-tables are field-oriented, so the amount of output is currently #segments * #DocValue_fields * #reopens = verbose. Much too fine-grained from what Uwe describes. I'll remove it all.

Same goes for the options for enabling & disabling the caches. Should it be relevant at a later point, that part is quite easy to re-introduce.

asfimport commented 5 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Cleaned patch (no output, no on/off switching) as suggested by @uschindler and @dsmiley. The patch applies to master as per 2018-11-07 and passes ant precommit, ant test and ant test-forbidden-apis. The core code is the same as LUCENE-8374.patch so anyone wanting to play with turning caching on/off should use that instead.

This patch is what I hope to commit to master. Or rather: This is the first version of what I hope to commit, as I would be grateful if anyone would take a look at it and tell me if I messed up somewhere. I am not too worried about functionality and Jenkins helps there. It is mostly architecturally where I am on shaky ground.

asfimport commented 5 years ago

Lucene/Solr QA (migrated from JIRA)

+1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
master Compile Tests
+1 compile 0m 32s master passed
Patch Compile Tests
+1 compile 0m 30s the patch passed
+1 javac 0m 30s the patch passed
+1 Release audit (RAT) 0m 30s the patch passed
+1 Check forbidden APIs 0m 30s the patch passed
+1 Validate source patterns 0m 30s the patch passed
Other Tests
+1 unit 13m 36s core in the patch passed.
16m 55s
Subsystem Report/Notes
JIRA Issue LUCENE-8374
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12947230/LUCENE-8374.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 4.4.0-137-generic #163\~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 1b084db
ant version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018
Default Java 1.8.0_191
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/120/testReport/
modules C: lucene/core U: lucene/core
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/120/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 5 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I have cleaned up the previous patch (removed dead code, out-commented lines etc.). Reviewing 2000+ patch-lines is a daunting task, so instead of trying to get anyone to commit to that, let me present the different parts - if anything sounds odd, don't hesitate to comment. I plan to push it to master in the near future.

2 new utility classes are used for creating and accessing compact representations of sparse arrays of whole numbers:

3 new classes are used to hold the generated jump-tables:

2 classes are changed to use the caches:

asfimport commented 5 years ago

Lucene/Solr QA (migrated from JIRA)

+1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
master Compile Tests
+1 compile 1m 26s master passed
Patch Compile Tests
+1 compile 0m 19s the patch passed
+1 javac 0m 19s the patch passed
+1 Release audit (RAT) 0m 19s the patch passed
+1 Check forbidden APIs 0m 19s the patch passed
+1 Validate source patterns 0m 19s the patch passed
Other Tests
+1 unit 30m 26s core in the patch passed.
37m 1s
Subsystem Report/Notes
JIRA Issue LUCENE-8374
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12947631/LUCENE-8374.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 42ee966
ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java 1.8.0_191
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/121/testReport/
modules C: lucene lucene/core U: lucene
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/121/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'd really like to bake it into the codec and avoid computing the cache dynamically. If someone really needs this feature with the current codec, they could make a custom build that applies your patch?

As far as reviewing is concerned, maybe it would help to split the improvements to skipping blocks and skipping over blocks into two different patches (or commits)? That might help dig test failures or performance issues in the future as well since git bisect would point to a smaller commit.

asfimport commented 5 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

@jpountz I do not understand your resistance against providing an out-of-the-box performance improvement to Lucene-based search engines. I do not see index-upgrades as trivial exercises and I have previously encountered quite a lot of resistance against trying patched versions of Lucene/Solr. As a developer it is probably second nature to you. For a lot of users, not so much.

Your patch-splitting suggestion is interesting. The three parts [DISI-blocks, dense and variable bits per value] could reasonably easy be split into [DISI-blocks, dense] and [variable bits per value]. Or they could not-so-easily be split into all 3 parts. They would be additive though (as opposed to independent), as they share a lot of structure. If splitting helps acceptance, I'll try and commit that way. I'm also game to make a split patch, if someone commits to do a review.

Note: I am aware that Lucene/Solr 7.6 is in pre-release and that larger patches should not be committed to master until 7.6 has been released.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I do not understand your resistance against providing an out-of-the-box performance improvement to Lucene-based search engines.

I have nothing against making performance better, quite the opposite! But as often there are trade-offs and this patch spends significant CPU and I/O when opening the reader in order to build a cache while the same information could be computed at index time. This might be fine for static readers, but it doesn't sound like a net win otherwise.

I do not see index-upgrades as trivial exercises and I have previously encountered quite a lot of resistance against trying patched versions of Lucene/Solr.

Sure, I agree patched versions have lots of drawbacks. I was suggesting this workaround in the case upgrading was a no-go given that this is more acceptable to me than adding such caching to the codec.

If splitting helps acceptance, I'll try and commit that way.

Thanks. I'm mostly interested in splitting from the perspective of tracking regressions. For instance it will be easier to track which change a slowdown/speedup relates to in Mike's nightly benchmarks.

I'm also game to make a split patch, if someone commits to do a review.

I'll take care of reviews.

Note: I am aware that Lucene/Solr 7.6 is in pre-release and that larger patches should not be committed to master until 7.6 has been released.

I don't think this is an issue. Nick already cut branch_7_6, so the 7.6 release shouldn't impact work that we are doing on the master and 7.x branches?

asfimport commented 5 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

this patch spends significant CPU and I/O when opening the reader

This patch spends the IO equivalent of making a single worst-case search matching 1/16384th of the documents in the newly opened segment, sorting on a numeric DocValues field, for each used DocValued field. I am hard pressed to imagine a setup where that IO-penalty wouldn't pay off very quickly.

As for CPU it does perform a compression on the DENSE data structure, to lower the memory overhead. In Tim's setup, I could not get the timing for his smaller segments as they took less than 1ms. For his largest segment (19M documents) it took a total of 23ms across 41 DocValues fields. For Tim his startup overhead (a total of 412ms across all segments & fields) was paid off in less than a second of operation.

Based on this, I disagree on your assertion of "significant" performance overhead for search-time. It might turn out to be significant when applied broadly or tested in depth, but at this point neither theory nor observations suggests so.

For instance it will be easier to track which change a slowdown/speedup relates to in Mike's nightly benchmarks.

As discussed on Mike's GitHub space, his nightly benchmarks are not really tracking this part (large-jump access of DocValues) of Lucene. If they were, an alarm would have been raised on the switch to the iterative API.

It is quite understandable that such tests are missing, as large indexes requires a lot of computing resources and/or time. How to quickly catch such regressions in the future, without spending a lump sum on hardware, is a challenge that I have no ready answers for.

In its current form, Mike's benchmarks seems fine for tracking LUCENE-8374 regressions in the areas where there is no expectation of a speed-up, which is an important part of testing the value of an attempted optimization.

I'll take care of reviews.

Thank you, I appreciate that.

As for the wait-for-7.6, I just based it on the note in https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/ajax/%3CCABM%2Bu9JU%3DQFshN_juicOohU8dZQBtZ0Nc7Zs79_Wt9wZRjTVqA%40mail.gmail.com%3E about big changes. I am happy to classify this patch as a non-big change.

asfimport commented 5 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

As suggested, I have divided the patch into 4 parts:

  1. IndexedDISI block skips (every 65536 docIDs, low impact)
  2. IndexedDISI DENSE block (high impact)
  3. Varying Bits Per Value (high impact)
  4. The 3 above enabled for Lucene70NormsProducer

Collectively this is the same as the full LUCENE-8374 patch. The split is for easier review and to add the changes as 4 commits, for easier bisection in search of errors or other surprising behaviour.

I will let this sit here for a few days. If nothing happens, I plan to push to master to let Jenkins play with it.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 58a7a8ada5cebeb261060c56cd6d0a9446478bf6 in lucene-solr's branch refs/heads/master from @tokee https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=58a7a8a

LUCENE-8374 part 1/4: Reduce reads for sparse DocValues

Offset and index jump-table for IndexedDISI blocks.