Performance & memory optimizations

GoogleCodeExporter commented 9 years ago

Version 1.0 focused on providing a concurrent architecture, but did not receive 
a deep performance analysis. A performance release should be provided to fine 
tune the data structure. A regression test [1] indicated that increased memory 
usage is the primary concern. Below lists optimization ideas (please suggest 
others).

[1] https://issues.apache.org/jira/browse/CASSANDRA-975

-----
Drain all buffers when the LRU lock is acquired.
 + avoid artificial failure (coerced in load test)
 + increased hit rate
 + reduced memory
 + fewer drains
 - more work per drain

Currently only a segment's recency buffer is drained when it exceeds the 
threshold or a write occurs. Instead all buffers should be drained. The 
amortized cost should remain small, with an increased penalty for writers who 
proactively drain.

-----
Remove segment lock for writes
 + Removes CLHM lock on write (expands CHM's lock scope)
 + Removes hashing tricks to match CHM's lock to avoid contention
 + Write queue no longer needs to be a FIFO (speedup tricks)?
 + May make a (minor) speedup.

Currently a strict ordering is required so that the write queue is ordered with 
a segment (e.g. put followed by remove). This ordering constraint could be 
removed by making the logic tolerant to race conditions. This could be done by 
adding a status flag to the WeightedValue.

-----
Replace segment recency queue with AtomicReferenceArray-based circular buffer. 
 + Removes queue tail as a hotspot
 - May be complicated, need to think through
 - The queue doesn't appear to be a bottleneck yet.

A read to a segment may contend on (1) adding to the segment's recency queue, 
(2) incrementing the queue's length. The idea is to reuse (2) as the array slot 
so that only one CAS operation is required. This may not be worth the effort 
since this has not been shown to be a problem and an increase to the 
concurrencyLevel would reduce contention.

-----
Determine the best threshold value.
 + Ideal balance would optimize concurrency vs. memory usage.

Currently the recency threshold is 64. This is a magic number chosen at random 
and it was not evaluated.

Original issue reported on code.google.com by Ben.Manes@gmail.com on 15 Jun 2010 at 1:12

GoogleCodeExporter commented 9 years ago

Use merge-sort when draining recency queues to maintain stricter ordering
 + Maintains strict LRU ordering (vs. grouping of entries with like recencies).
 + Closer to what user's expect when testing against LHM.
 - Increases draining penalty by replacing O(n) with O(nlg n) operation
 - Should not have a statistically significant difference to hit rate.
 - Requires a wrapper with a timestamp as the element on the recency queue.

A recency queue is maintained per segment and provides a FIFO of pending read 
reorder operations. The draining of the queues is applied by applying all of 
the operations in each queue in turn but disregards the ordering with respect 
to the other queues. The result is that the global ordering maintains more 
recent entries at the tail so that a cold entry will be chosen for eviction. A 
stricter ordering incurs a higher penalty but will probably not provide a 
noticeable improvement to the hit rate.

The LIRS algorithm was intended to resolve the concerns over a weaker recency 
ordering by accounting for frequency. However since that work has stalled and 
users are sometimes surprised in their own tests, it may be worth changing. The 
JavaDoc is intentionally vague about the page replacement policy to avoid 
expectations other than that a good balance of performance/concurrency/hit-rate 
was chosen.

In the current configuration, the worst case number of elements to drain is 
1,024 [16x64] would mean that the penalty increases by 10x [log2(1024)]. I am 
open to this change if others prefer a stricter recency constraint.

Original comment by Ben.Manes@gmail.com on 26 Jun 2010 at 9:55

Changed title: 10

GoogleCodeExporter commented 9 years ago

Original comment by Ben.Manes@gmail.com on 27 Jun 2010 at 12:21

Changed title: Performance & memory optimizations

GoogleCodeExporter commented 9 years ago

Correction, because each recency queue is ordered we can do better than O(nlg 
n) time. A recursive merge sort will be O(nlg k) where k is the number of 
queues and n is the number of elements. For the 16x64 configuration, that is a 
4x penalty.

We can improve this further by moving the timestamp from the proposed recency 
wrapper and instead maintain it on the entry. This allows us to avoid doing 
unnecessary reorderings, such as we can reduce the sequence ABABA to BBAAA to 
BA. This reduces the amount of work in the penalty cycle at the expense of 
maintaining an lastAccessTime field per list node. One benefit is that we can 
reuse that field to support a timeToIdle expiration policy.

With these observations the added penalty of maintaining a strict LRU is very 
small. I'm strongly leaning now towards making this change.

Original comment by Ben.Manes@gmail.com on 29 Jun 2010 at 6:15

GoogleCodeExporter commented 9 years ago

Quick question: have you had a chance to use JConsole or VisualVM to get an 
idea where the memory overhead is coming from?  I have been sick this week, so 
I am a bit behind :).

Original comment by zells...@gmail.com on 2 Jul 2010 at 1:49

GoogleCodeExporter commented 9 years ago

I haven't had time to do more than brainstorm ideas, so I haven't done a real 
analysis. The most likely cause are the recency queues, so an improved draining 
process should alleviate the problem.

Original comment by Ben.Manes@gmail.com on 2 Jul 2010 at 1:58

GoogleCodeExporter commented 9 years ago

Randomly select recency queue to append to on access.
 + Evenly spreads appending to avoid "hot" queues (due to "hot" entries)
 + Avoids CAS contention by reducing a single point of bottleneck.
 - May be expensive to select queue (random, thread-local selection counter).
 - Reads are already fast, so not worth the effort.

This is primarily to document an observation. The before mentioned insight that 
since the recency queues are ordered they can be merged back into a strict 
recency order before applying to the page replacement policy. This means that 
there is no need to associate a read to a particular queue as global ordering 
is maintained.

Currently recency queues are associated with a segment, with 16 being the 
default. A cache usually has 10-20% that is "hot" and shouldn't be discarded. 
This means that a bad spread could result in a single queue being pounded on 
which would cause contention on the tail (CAS spins). By not using hashing to 
determine the queue more even spread could be achieved thereby reducing a 
performance degradation by overworking a subset of queues. As random functions 
are fairly expensive, a thread-local counter could trivially spread the 
operations. The queues could then be disconnected from the segment and 
increased by a read concurrency setting.

This enhancement is impractical because we use a hash spreader to evenly 
distribute entries in the table. For a cache of any moderate size the hot 
entries would reside in different segments and thereby evenly distributes the 
work to different queues. A very small cache, e.g. 16, with the same general 
characteristics (~20% hot) may experience the symptom. However, this would 
require a considerable amount of thrashing on a CAS, which experience has shown 
to rarely be a performance bottleneck (e.g. AtomicInteger counters). Such a 
hypothetical pathological case could be solved by a more appropriate data 
structure.

This thought experiment is mostly to show that the usage of recency queues and 
their ordering characteristics allows us to easily scale the data structure. At 
the extreme the adoption of thread-local queues would allow infinite scaling of 
the algorithm by avoiding any shared state.

Original comment by Ben.Manes@gmail.com on 3 Jul 2010 at 8:37

GoogleCodeExporter commented 9 years ago

I added the scaffolding for Caliper (Google's benchmarking tool). I'm planning 
on flushing out a read/write benchmark. I'd appreciate hearing any ideas on 
what types of workloads to benchmark.

Original comment by Ben.Manes@gmail.com on 5 Jul 2010 at 4:11

GoogleCodeExporter commented 9 years ago

There are number of scenarios discussed in 
http://ehcache.org/Ehcache1_7_Terracotta311_Benchmarks.pdf.  The obvious 
scenarios:

1) Read-heavy (~80/20)
2) Mixed (50/50)
3) Write-heavy (~30/70)

I don't know if the size of the cached objects makes much of a difference, as 
long as GC does not thrash.  Will try and look into using JConsole to track 
memory overhead vs. pure speed.

Original comment by zells...@gmail.com on 5 Jul 2010 at 4:20

GoogleCodeExporter commented 9 years ago

It would probably be easier to evaluate it with code by hooking up an 
instrumentation agent and writing a benchmark to output the comparision. 
(http://sizeof.sourceforge.net ; 
http://code.google.com/p/java-allocation-instrumenter).

Original comment by Ben.Manes@gmail.com on 9 Jul 2010 at 1:57

GoogleCodeExporter commented 9 years ago

Some of these are done. What's open is:

1) Remove segment lock

This adds some complexity, but is doable. It requires retry'ing on a write if 
it determines that it lost the mutate, ala CAS. For those using 
j.u.c.ConcurrentHashMap this is worse. For those using Cliff Click's lock-free 
variant this may be better (Cassandra). It all depends on how write-heavy the 
usage is, but since caches are read-centric I doubt this would help much. It 
does partially explain why Cassandra is seeing lower write performance vs. the 
second-chance version, as locks are used in this implementation.

2) Replace segment recency queue with AtomicReferenceArray-based circular 
buffer.

Not worth the fuss. The queue is now chosen the the thread_id rather than being 
associated to the segment. This avoids CAS contention by more evenly spreading 
hot reads.

3) Determine the best threshold value.

Need benchmarks. Lowest priority and I don't have the time.

4) Use merge-sort when draining recency queues to maintain stricter ordering

I tried this (attached). It adds to the write penalty by a few 2-3us. Thoughts?

5) LIRS

I'd really like to get back to this. I think it would offer the most value. 
I've has zero time, though...

Original comment by Ben.Manes@gmail.com on 25 Oct 2010 at 3:55

GoogleCodeExporter commented 9 years ago

Original comment by Ben.Manes@gmail.com on 25 Oct 2010 at 3:56

Attachments:

ConcurrentLinkedHashMap.java

GoogleCodeExporter commented 9 years ago

Fixed in r475. In conclusion, the following changes were made:

The number of recency operations that can be buffered was reduced from 64 to 16 
per queue. The buffered tasks are now weak references to allow for garbage 
collection if, due to concurrent behavior, a stale entry is still queued 
despite the entry being removed/evicted.

The recency queues are now selected by the thread id, to allow for a better 
distribution. This ensures that threads due not contend on a queue due to 
thrashing on a hot entry. This improved spread also allowed for the reduction 
of the recency threshold and a more proactive draining behavior.

To avoid surprises when compared to LinkedHashMap and provide a pure LRU 
policy, the recency queues are now drained in strict LRU order. This is done by 
performing an O(n lg k) merge of the queues by assigning each recency an 
ordering value when added to a queue. Due to the non-blocking nature of the 
draining, the average number of elements buffered, and special care in the code 
to be performant, this should not have significant overhead.

These changes should resolve all pending work on the LRU version. I will try to 
revisit the LIRS policy next, which should have a noticeable performance 
improvement due to providing a higher hit rate with only slightly more overhead 
compared to an LRU.

Please retest with v1.1-LRU and revisit upgrading library in Cassandra.

Original comment by Ben.Manes@gmail.com on 1 Nov 2010 at 2:55

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I only see 1.0 under downloads still, should I wait for a 1.1?

Original comment by jbel...@gmail.com on 4 Nov 2010 at 2:45

GoogleCodeExporter commented 9 years ago

Sorry, its been a long week. :)

I'm starting the release now.

Original comment by Ben.Manes@gmail.com on 4 Nov 2010 at 2:56

GoogleCodeExporter commented 9 years ago

v1.1 JDK6 is now on Maven Central and in the download section.

I'll work on the backport to JDK5 next.

Original comment by Ben.Manes@gmail.com on 4 Nov 2010 at 3:54

ben-manes / concurrentlinkedhashmap

Performance & memory optimizations #17