Closed GoogleCodeExporter closed 9 years ago
Use merge-sort when draining recency queues to maintain stricter ordering
+ Maintains strict LRU ordering (vs. grouping of entries with like recencies).
+ Closer to what user's expect when testing against LHM.
- Increases draining penalty by replacing O(n) with O(nlg n) operation
- Should not have a statistically significant difference to hit rate.
- Requires a wrapper with a timestamp as the element on the recency queue.
A recency queue is maintained per segment and provides a FIFO of pending read
reorder operations. The draining of the queues is applied by applying all of
the operations in each queue in turn but disregards the ordering with respect
to the other queues. The result is that the global ordering maintains more
recent entries at the tail so that a cold entry will be chosen for eviction. A
stricter ordering incurs a higher penalty but will probably not provide a
noticeable improvement to the hit rate.
The LIRS algorithm was intended to resolve the concerns over a weaker recency
ordering by accounting for frequency. However since that work has stalled and
users are sometimes surprised in their own tests, it may be worth changing. The
JavaDoc is intentionally vague about the page replacement policy to avoid
expectations other than that a good balance of performance/concurrency/hit-rate
was chosen.
In the current configuration, the worst case number of elements to drain is
1,024 [16x64] would mean that the penalty increases by 10x [log2(1024)]. I am
open to this change if others prefer a stricter recency constraint.
Original comment by Ben.Manes@gmail.com
on 26 Jun 2010 at 9:55
Original comment by Ben.Manes@gmail.com
on 27 Jun 2010 at 12:21
Correction, because each recency queue is ordered we can do better than O(nlg
n) time. A recursive merge sort will be O(nlg k) where k is the number of
queues and n is the number of elements. For the 16x64 configuration, that is a
4x penalty.
We can improve this further by moving the timestamp from the proposed recency
wrapper and instead maintain it on the entry. This allows us to avoid doing
unnecessary reorderings, such as we can reduce the sequence ABABA to BBAAA to
BA. This reduces the amount of work in the penalty cycle at the expense of
maintaining an lastAccessTime field per list node. One benefit is that we can
reuse that field to support a timeToIdle expiration policy.
With these observations the added penalty of maintaining a strict LRU is very
small. I'm strongly leaning now towards making this change.
Original comment by Ben.Manes@gmail.com
on 29 Jun 2010 at 6:15
Quick question: have you had a chance to use JConsole or VisualVM to get an
idea where the memory overhead is coming from? I have been sick this week, so
I am a bit behind :).
Original comment by zells...@gmail.com
on 2 Jul 2010 at 1:49
I haven't had time to do more than brainstorm ideas, so I haven't done a real
analysis. The most likely cause are the recency queues, so an improved draining
process should alleviate the problem.
Original comment by Ben.Manes@gmail.com
on 2 Jul 2010 at 1:58
Randomly select recency queue to append to on access.
+ Evenly spreads appending to avoid "hot" queues (due to "hot" entries)
+ Avoids CAS contention by reducing a single point of bottleneck.
- May be expensive to select queue (random, thread-local selection counter).
- Reads are already fast, so not worth the effort.
This is primarily to document an observation. The before mentioned insight that
since the recency queues are ordered they can be merged back into a strict
recency order before applying to the page replacement policy. This means that
there is no need to associate a read to a particular queue as global ordering
is maintained.
Currently recency queues are associated with a segment, with 16 being the
default. A cache usually has 10-20% that is "hot" and shouldn't be discarded.
This means that a bad spread could result in a single queue being pounded on
which would cause contention on the tail (CAS spins). By not using hashing to
determine the queue more even spread could be achieved thereby reducing a
performance degradation by overworking a subset of queues. As random functions
are fairly expensive, a thread-local counter could trivially spread the
operations. The queues could then be disconnected from the segment and
increased by a read concurrency setting.
This enhancement is impractical because we use a hash spreader to evenly
distribute entries in the table. For a cache of any moderate size the hot
entries would reside in different segments and thereby evenly distributes the
work to different queues. A very small cache, e.g. 16, with the same general
characteristics (~20% hot) may experience the symptom. However, this would
require a considerable amount of thrashing on a CAS, which experience has shown
to rarely be a performance bottleneck (e.g. AtomicInteger counters). Such a
hypothetical pathological case could be solved by a more appropriate data
structure.
This thought experiment is mostly to show that the usage of recency queues and
their ordering characteristics allows us to easily scale the data structure. At
the extreme the adoption of thread-local queues would allow infinite scaling of
the algorithm by avoiding any shared state.
Original comment by Ben.Manes@gmail.com
on 3 Jul 2010 at 8:37
I added the scaffolding for Caliper (Google's benchmarking tool). I'm planning
on flushing out a read/write benchmark. I'd appreciate hearing any ideas on
what types of workloads to benchmark.
Original comment by Ben.Manes@gmail.com
on 5 Jul 2010 at 4:11
There are number of scenarios discussed in
http://ehcache.org/Ehcache1_7_Terracotta311_Benchmarks.pdf. The obvious
scenarios:
1) Read-heavy (~80/20)
2) Mixed (50/50)
3) Write-heavy (~30/70)
I don't know if the size of the cached objects makes much of a difference, as
long as GC does not thrash. Will try and look into using JConsole to track
memory overhead vs. pure speed.
Original comment by zells...@gmail.com
on 5 Jul 2010 at 4:20
It would probably be easier to evaluate it with code by hooking up an
instrumentation agent and writing a benchmark to output the comparision.
(http://sizeof.sourceforge.net ;
http://code.google.com/p/java-allocation-instrumenter).
Original comment by Ben.Manes@gmail.com
on 9 Jul 2010 at 1:57
Some of these are done. What's open is:
1) Remove segment lock
This adds some complexity, but is doable. It requires retry'ing on a write if
it determines that it lost the mutate, ala CAS. For those using
j.u.c.ConcurrentHashMap this is worse. For those using Cliff Click's lock-free
variant this may be better (Cassandra). It all depends on how write-heavy the
usage is, but since caches are read-centric I doubt this would help much. It
does partially explain why Cassandra is seeing lower write performance vs. the
second-chance version, as locks are used in this implementation.
2) Replace segment recency queue with AtomicReferenceArray-based circular
buffer.
Not worth the fuss. The queue is now chosen the the thread_id rather than being
associated to the segment. This avoids CAS contention by more evenly spreading
hot reads.
3) Determine the best threshold value.
Need benchmarks. Lowest priority and I don't have the time.
4) Use merge-sort when draining recency queues to maintain stricter ordering
I tried this (attached). It adds to the write penalty by a few 2-3us. Thoughts?
5) LIRS
I'd really like to get back to this. I think it would offer the most value.
I've has zero time, though...
Original comment by Ben.Manes@gmail.com
on 25 Oct 2010 at 3:55
Original comment by Ben.Manes@gmail.com
on 25 Oct 2010 at 3:56
Attachments:
Fixed in r475. In conclusion, the following changes were made:
The number of recency operations that can be buffered was reduced from 64 to 16
per queue. The buffered tasks are now weak references to allow for garbage
collection if, due to concurrent behavior, a stale entry is still queued
despite the entry being removed/evicted.
The recency queues are now selected by the thread id, to allow for a better
distribution. This ensures that threads due not contend on a queue due to
thrashing on a hot entry. This improved spread also allowed for the reduction
of the recency threshold and a more proactive draining behavior.
To avoid surprises when compared to LinkedHashMap and provide a pure LRU
policy, the recency queues are now drained in strict LRU order. This is done by
performing an O(n lg k) merge of the queues by assigning each recency an
ordering value when added to a queue. Due to the non-blocking nature of the
draining, the average number of elements buffered, and special care in the code
to be performant, this should not have significant overhead.
These changes should resolve all pending work on the LRU version. I will try to
revisit the LIRS policy next, which should have a noticeable performance
improvement due to providing a higher hit rate with only slightly more overhead
compared to an LRU.
Please retest with v1.1-LRU and revisit upgrading library in Cassandra.
Original comment by Ben.Manes@gmail.com
on 1 Nov 2010 at 2:55
I only see 1.0 under downloads still, should I wait for a 1.1?
Original comment by jbel...@gmail.com
on 4 Nov 2010 at 2:45
Sorry, its been a long week. :)
I'm starting the release now.
Original comment by Ben.Manes@gmail.com
on 4 Nov 2010 at 2:56
v1.1 JDK6 is now on Maven Central and in the download section.
I'll work on the backport to JDK5 next.
Original comment by Ben.Manes@gmail.com
on 4 Nov 2010 at 3:54
Original issue reported on code.google.com by
Ben.Manes@gmail.com
on 15 Jun 2010 at 1:12