Open asfimport opened 4 years ago
Ben Manes (migrated from JIRA)
In retrospect, a separate QueryCache
should be implemented. LruQueryCache
declares in its contract that methods like onHit
, onQueryEviction
, etc. are executed under the global lock. This means implementations may rely on this exclusive read/write access to data structures, a requirement that cannot be supported efficiently. There are external usages of these hooks, such as in ElasticSearch, which would need to be reviewed. A deprecation and migration would be a safer approach for a transition.
Adrien Grand (@jpountz) (migrated from JIRA)
Hi Ben
This looks like a good summary of the issues that affect our cache. I'm quite impressed by how thorough this list is!
In retrospect, a separate QueryCache should be implemented. LruQueryCache declares in its contract that methods like onHit, onQueryEviction, etc. are executed under the global lock. This means implementations may rely on this exclusive read/write access to data structures, a requirement that cannot be supported efficiently.
If we can build a better cache then we will find a way to transition our users to it, I wouldn't worry about the migration path yet. We'll figure something out.
Since the developers are experts on search, not caching, it seems justified to evaluate if an off-the-shelf library would be more helpful in terms of developer time, code complexity, and performance.
We want lucene-core to be dependency-free, so we couldn't add caffeine as a dependency of lucene-core. However other options include having it as a dependency of a module that would expose a different cache implementation, reuse some of its ideas in the current cache implementation or fork the code that we need.
It appears that the cache's overhead can be just as much of a benefit as a liability, causing various workarounds and complexity.
FYI when I implemented this cache, I went for simplicity in terms of locking, so there is certainly room for improvement. One thing that is not obvious immediately and makes implementing a query cache for Lucene a bit tricky is that it needs to be able to efficiently evict all cache entries for a given segment. This is the reason why the current implementation uses two levels of maps instead of a single map that would take (Query,CacheHeler.Key) pairs as keys.
Ben Manes (migrated from JIRA)
On the train ride to work, I started to play with stubbing out an implementation to better understand what an implementation could look like. For now I'm just untangling things in my head due to lack of familiarity and not expecting anything to be adopted.
> We want lucene-core to be dependency-free, so we couldn't add caffeine as a dependency of lucene-core.
I am certainly fine with that and worry about it if I can offer something promising. In addition to the options you mentioned, we could shade / shadow the dependency to an internal package name.
> One thing that is not obvious immediately and makes implementing a query cache for Lucene a bit tricky is that it needs to be able to efficiently evict all cache entries for a given segment.
Thank you. I was trying to understand the LeafCache
and was still under the impression that it was unnecessary complexity. Can you explain why caching of segments is needed? This certainly makes it a lot harder since they grow, as you cache the queries at the segment level.
Is this so that when updates occur all of the related cached queries are invalidated, to avoid stale responses? If so, would some versioning / generation field be applicable to maintain a single level cache? In that model the generation id is part of the key, allowing a simple increment to cause all of the prior content to not be fetched. This is common in remote caches (e.g. memcached) and, if doable here, we could maintain an index to proactively remove those stale entries.
Adrien Grand (@jpountz) (migrated from JIRA)
This is not due to invalidation but to how Lucene groups data into segments, that get regularly merged together into fewer bigger segments. When segments get merged away, they are closed, which triggers a callback on the cache that tells it that it may remove all entries that are about these segments, since they will never be used again. Before Lucene introduced a query cache, Elasticsearch used to have its own query cache that was based on Guava and used (Query,CacheKey.Helper) pairs as keys, and used to evict all entries for a segment by iterating over all cached entries and removing those that were about this segment. It triggered some interesting behaviors when closing top-level readers, which in-turn closes all their segments in sequence, which in-turn iterates over all remaining cached entries. So if you want to cache Q queries and have S segments, then you may have up to QxS entries in your cache, and thus closing the reader runs in O(QxS^2), and we were seeing users whose clusters would take ages to close indices because of this. One could make the argument that is is not required to evict those entries and that we could wait for them to get evicted naturally, but I don't like the idea of spending some of the JVM memory on unused data.
Ben Manes (migrated from JIRA)
Interesting. Off the cuff...
It sounds like we'd want to orchestrate it such that a write to level-2 needs to communicates to level-1 that the instance was modified, e.g. replace(key, v1, v1)
. That would trigger Guava/Caffeine to re-weigh the entry and trigger an eviction. If that write-back failed, e.g. removed or became v2
, then the caller would have to manually call the entry's eviction logic (e.g. if closable). The L1 would be bounded to evict segments and L2 unbounded, which matches the current implementation. The coordination would need to be handled, but shouldn't be overly tricky, if I understand correctly.
Ben Manes (migrated from JIRA)
Attached a very rough sketch of what this could look like. A cache hit would be lock-free and a miss would be performed under a per-segment computeIfAbsent
. A cheap computation back would cause the segment to be re-weighed, perhaps triggering an eviction. A lot of LruQueryCache
needs to be ported over, but I think that is straightforward. It may look a lot like the current cache in the end, but benefit from having concurrent data structures to work off of.
Let me know if you think this is the right direction.
Ben Manes (migrated from JIRA)
Why would it be O(QxS^2)
and not O(Q)
entries to remove?
My initial impression would be to also have the key = (Query,CacheKey)
pair and let the eviction policy drop the individual entries. When a new entry is added, we would maintain a separate index, CacheKey => Set<key>
, to remove when the invalidating the CacheKey
. This secondary index would be maintained using computations and be a weakKey reference cache, which lets us ignore a few subtle races.
The caching policy already includes a frequency-based admission filter on top of LRU, similar in spirit to your scheme. However in a 2 layer model it would track frequency of the leaf cache rather than the query entries, which negates its usefulness. I think due to richer computations and better write concurrency we could make a simpler 1-layer cache work efficiently.
Do you have benchmark scenarios that I could run? If so, I might create both versions so that we could benchmark to see which is preferable.
Adrien Grand (@jpountz) (migrated from JIRA)
The L1 would be bounded to evict segments and L2 unbounded, which matches the current implementation.
I don't think it matches the current implementation. When going over the maximum weight, your patch would remove all cache entries from the least-recently-used segment, while the current implementation removes all cache entries from the least-recently used query? For the record, this part of the current impl is not great as it makes eviction run in linear time with the number of segments, but I couldn't find any other way that wouldn't introduce worse issues.
Why would it be O(QxS^2) and not O(Q) entries to remove?
If we are talking about the size of the cache, then it would be about O(QxS) assuming that queries are cached on every segment. Here I was more commenting on sequentially removing entries for all segments, one segment at a time and the overall runtime of doing this with a cache that uses (Query,CacheKey) pairs as keys.
Do you have benchmark scenarios that I could run?
Unfortunately I don't.
Ben Manes (migrated from JIRA)
hmm.. okay, so the behavior is certainly too complex for my initial runs at the problems.
A few ideas to consider based on evolving the cache and helping to alleviate some of the challenges.
tryLock
on read by borrowing the trick to record the access into a striped, lossy ring buffer and replay the events under the lock. This lets you use a ConcurrentHashMap
for lock-free reads. Instead of contending on a lock to perform tiny LRU reordering work, you schedule and perform a batch under the lock to modify the non-threadsafe data structures. The striping and lossy behavior mitigates hot spots under high load.var lock = locks[key.hashCode() % locks.length]
).This would mitigate some of the problems but also add complexity, whereas I originally hoped to reduce that burden.
Adrien Grand (@jpountz) (migrated from JIRA)
These sound like good ideas to me.
Ben Manes (migrated from JIRA)
I added a per segment ring buffers for recording the query hits. When over half full then tryLock
is used to optimistically drain back onto the LRU. When an entry is added via putIfAbsent
, the buffers are drained prior to the eviction. The maps were changed to use ConcurrentHashMap
so that reads can be lock-free when successful. This should improve read throughput, but is no longer a strict LRU like your tests expect. In the face of concurrency that shouldn't be expected, so the stats + lru ordering tests are intentionally broken.
This design is similar to Guava's cache which had forked the legacy CHM to add per-segment features. When I had proposed how to add LRU, I had stubbed the code using ConcurrentLinkedQueue
read buffers. Unfortunately my intent to replace that with ring buffers never garnered interest, even though it improved performance 25x. This still suffers from hotspots as hot entries are in the same segment, but similarly its least invasive fix given existing code.
I also added a simple striped locking scheme for computing the query results. This should allow only a single thread to compute the results for a given query and avoid stampede effects. A subsequent thread would block waiting for the first to finish caching the results.
How does this look?
Ben Manes (migrated from JIRA)
I tried running the luceneutil benchmark against this change rebased on master. The benchmark is pretty noisy and not sure how the cache interacts, but these were the results.
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
Respell 184.34 (32.4%) 167.09 (34.9%) -9.4% ( -57% - 85%)
Fuzzy1 213.08 (15.1%) 202.41 (15.4%) -5.0% ( -30% - 30%)
BrowseMonthSSDVFacets 1789.91 (10.4%) 1759.04 (11.6%) -1.7% ( -21% - 22%)
LowTerm 3172.83 (11.3%) 3149.06 (11.1%) -0.7% ( -20% - 24%)
LowSloppyPhrase 510.21 (12.6%) 505.35 (5.4%) -1.0% ( -16% - 19%)
OrHighLow 911.22 (11.4%) 907.11 (8.7%) -0.5% ( -18% - 22%)
MedSpanNear 639.59 (14.3%) 637.37 (11.8%) -0.3% ( -23% - 29%)
HighTermMonthSort 1410.18 (14.8%) 1414.44 (17.8%) 0.3% ( -28% - 38%)
OrHighHigh 282.72 (18.9%) 283.90 (27.8%) 0.4% ( -38% - 58%)
AndHighLow 1811.44 (16.5%) 1826.13 (8.5%) 0.8% ( -20% - 30%)
LowPhrase 830.24 (12.8%) 837.28 (9.7%) 0.8% ( -19% - 26%)
BrowseDayOfYearSSDVFacets 1538.60 (9.5%) 1552.58 (11.5%) 0.9% ( -18% - 24%)
HighTerm 1010.87 (11.3%) 1020.64 (9.9%) 1.0% ( -18% - 24%)
MedPhrase 571.41 (11.5%) 579.31 (7.3%) 1.4% ( -15% - 22%)
MedSloppyPhrase 417.12 (21.1%) 423.51 (21.7%) 1.5% ( -34% - 56%)
LowSpanNear 746.19 (18.1%) 758.25 (12.9%) 1.6% ( -24% - 39%)
Wildcard 184.23 (29.0%) 187.63 (29.3%) 1.8% ( -43% - 84%)
BrowseDateTaxoFacets 2747.64 (15.6%) 2804.34 (14.5%) 2.1% ( -24% - 38%)
BrowseDayOfYearTaxoFacets 6748.62 (7.1%) 6900.47 (6.1%) 2.3% ( -10% - 16%)
AndHighHigh 608.66 (11.9%) 622.76 (16.0%) 2.3% ( -22% - 34%)
AndHighMed 1974.49 (14.2%) 2031.35 (10.3%) 2.9% ( -18% - 31%)
Fuzzy2 19.26 (69.9%) 19.84 (54.5%) 3.0% ( -71% - 423%)
MedTerm 2809.96 (9.1%) 2900.82 (10.7%) 3.2% ( -15% - 25%)
HighIntervalsOrdered 253.46 (37.7%) 261.87 (43.4%) 3.3% ( -56% - 135%)
BrowseMonthTaxoFacets 6838.39 (8.4%) 7109.56 (8.4%) 4.0% ( -11% - 22%)
HighSloppyPhrase 379.10 (20.3%) 395.81 (20.4%) 4.4% ( -30% - 56%)
HighTermDayOfYearSort 498.94 (15.7%) 527.78 (13.5%) 5.8% ( -20% - 41%)
PKLookup 158.51 (27.8%) 169.54 (12.9%) 7.0% ( -26% - 65%)
Prefix3 168.46 (38.7%) 180.95 (36.4%) 7.4% ( -48% - 134%)
HighPhrase 260.05 (34.0%) 279.62 (20.5%) 7.5% ( -35% - 94%)
IntNRQ 598.33 (33.7%) 651.97 (33.9%) 9.0% ( -43% - 115%)
OrHighMed 378.56 (32.7%) 427.55 (16.9%) 12.9% ( -27% - 93%)
HighSpanNear 217.85 (37.3%) 249.79 (36.3%) 14.7% ( -42% - 140%)
LRUQueryCache appears to play a central role in Lucene's performance. There are many issues discussing its performance, such as #8290, #8292, #9075, #9260, and #10045. It appears that the cache's overhead can be just as much of a benefit as a liability, causing various workarounds and complexity.
When reviewing the discussions and code, the following issues are concerning:
It seems that more and more items skip being cached because of concurrency and hit rate performance, causing special case fixes based on knowledge of the external code flows. Since the developers are experts on search, not caching, it seems justified to evaluate if an off-the-shelf library would be more helpful in terms of developer time, code complexity, and performance. Solr has already introduced Caffeine in SOLR-8241 and SOLR-13817.
The proposal is to replace the internals
LruQueryCache
so that external usages are not affected in terms of the API. However, like inSolrCache
, a difference is that Caffeine only bounds by either the number of entries or an accumulated size (e.g. bytes), but not both constraints. This likely is an acceptable divergence in how the configuration is honored.cc @sigram, @dsmiley
Migrated from LUCENE-9038 by Ben Manes, updated Nov 26 2019 Attachments: cache.patch (versions: 2), CaffeineQueryCache.java