Open aktau opened 1 year ago
@aktau Thank you for sharing the information! I think we can take a look this discussion by ben, author of Caffeine/TinyLFU: https://github.com/1a1a11a/libCacheSim/discussions/20
I got a new laptop last week (m3 max 14-core) and ran Caffeine's benchmark using 16 threads. At close to 1B reads/s (40% of an unbounded baseline), I have to disagree with those authors that this represents a scalability bottleneck. There is also an independent hit rate analysis by a library author who is trying to adopt their algorithm with their guidance. That shows a large fluctuation where the hit rate can decrease at larger sizes. It does not appear yet to be a reliably general purpose algorithm.
Let me give my opinion as well, since I've already gathered a lot of attempts at adopting it.
First, I tried to implement cache on hash table and lock free queues, as the authors wrote. I was annoyed, s3-fifo was indeed many times faster than lru cache implemented on hash table with global mutex, but still lost to ristretto and theine, which use bp-wrapper techniques. The profiler also showed that it was the eviction policy that was spending the bulk of the time. And I was only able to beat ristretto and theine after rewriting the implementation using bp-wrapper.
Actually, the big hit ratio jumps are caused by differences in the implementation of ghost queue. In the paper the authors suggested to keep storing items in the ghost queue (this is what redpanda does now and I decided to try to do it this way too). It seems that large big hit ratio jumps are caused by the fact that the size of the ghost queue is limited by the size of the main queue according to the algorithm. And with a larger cache capacity the capacity of the small queue increases => we store items longer => more items get into the main queue => the ghost queue also stores more items. In the end I came to the conclusion that storing small queue capacity + main queue capacity + ghost queue capacity = 0.1 capacity + 0.9 capacity + 0.9 capacity = 1.9 capacity of items is too cheater. So I decided to try storing hashes of items in ghost queue and evicting them in fifo order. And got frustrated π. Limiting the ghost queue to the size of the main queue works very badly on small cache sizes. The best I got tonight was limiting the ghost queue to the size of the entire cache with these results: r.txt. S3-FIFO (with minor modifications) actually outperformed or was about equal to W-TinyLFU on all traces except S3 and DS1 without much fluctuation.
In summary, I completely disagree about the scalability advantage of S3-FIFO over the other policies. In terms of hit ratio S3-FIFO can indeed often outperform W-TinyLFU, but feels bad on lfu-friendly traces. I think S3-FIFO has a right to live, especially if someone can make something more adaptive out of it, but the thought of dropping everything and rewriting everything on W-TinyLFU is getting more and more tempting π.
Though to be honest, what was causing the hit ratio to go down in DS1 I still don't understand
Thanks, @maypok86! It's always very interesting to hear an experience report.
I can't speak for any implementation other than my own, so it would be helpful if you could also compare hit rates against Caffeine in its simulator. Could you please give that a try? I expect that static W-TinyLFU will lose on recency-biased traces, as this was a known deficiency (shared with the S3-FIFO author in 2016 here). The adaptive scheme has worked very well in all of the workloads that I have tested, where the policy is competitive with the leading algorithm. I don't expect it to always win, but for a general-purpose cache, I wanted it to robustly be near the top in any user workload. There are likely areas deserving improvement, but without data, I have to wait until there is something to analyze and discuss.
If comparing just on hit rates, the best alternative policy that I have seen is LIRS2 (code, paper, video). However, LIRS (v1) is very complicated to implement correctly and debug, which is why I invested in improving TinyLFU, as it seemed to be a more promising approach that I could reasonably maintain. In their v2, it is competitive except for my adaptive stress test. Of course, in practice, one has to consider all of the other characteristics, like metadata overhead and concurrency needs. There are a lot of design tradeoffs to consider when deciding on a policy, as the best choice depends on the end goals.
I would have expected S3FIFO to support lock-free reads and blocking writes based on the description. Since the worst-case eviction time is O(n), that time under the lock could become a concern. I'd probably just evict the next item after a threshold scan limit because I would be concerned about an unknown user workload hitting the worst-case behavior (akin to GC thrashing causing long pauses). The BP-Wrapper approach amortizes those costs as it samples the read requests, and the read throughput exceeds practical needs, so the time it steals helps keep the critical section bounded and more predictable. It is more complex than using a Clock-based policy, but as an engineer, this predictability in tail latencies is worth the effort when providing a library. My very limited (and likely outdated) understanding is that Go does not have a very good suite of concurrent data structures and primitives, at least compared to Java's, so it can be much more difficult to implement efficient multi-threaded code. I don't recall the author implementing a thread-safe version, so there might be oversights, and I suppose those claims should be taken skeptically.
@maypok86 one thing i'm interested: did you compare the performance of xsync map and sharded map(throughput and gc)? from xsync author there is some overhead(https://github.com/puzpuzpuz/xsync/issues/102)? If xsync map always outperforms, I think i can also switch to that. Current Theine implementation is based on sharded map and I do some optimizations on it. Also as @ben-manes suggested, you can test hit rates using Caffeine's simulator, Theine's window queue and adaptive algorithm is a little different from Caffeine's
@Yiling-J I wanted to come with a separate issue about this (and most likely about theine benchmarks in general), but that would be a very long discussion, and it's much more important for me to finish work on otter right now. So I'll come with that, but much later π or else I was able to get my ideas to the author of memorycache, but it just took a huge amount of time.
@ben-manes LIRS2 looks very smart, I'll try to look in its direction, but I'm not sure it would be easy to maintain such a thing. I also tried running the caffeine simulator and the verdict is about the same: S3-FIFO was +- 2% on all traces used in otter except S3 and DS1 and otter showed about the same hit ratio. On DS1 (6000000 capacity) for example the loss was quite serious
ββββββββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ€βββββββββββββ€βββββββββββββ€βββββββββββββ€ββββββββββββ
β Policy β Hit Rate β Hits β Misses β Requests β Evictions β Steps β Time β
β βββββββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺβββββββββββββͺββββββββββββ£
β opt.Clairvoyant β 61.82 % β 27,019,597 β 16,685,382 β 43,704,979 β 10,685,382 β β 1.080 min β
ββββββββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββ’
β product.Caffeine β 58.51 % β 25,570,957 β 18,134,022 β 43,704,979 β 12,134,022 β β 39.94 s β
ββββββββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββ’
β two-queue.Qdlp β 46.65 % β 20,390,489 β 23,314,490 β 43,704,979 β 17,314,490 β 43,704,979 β 35.36 s β
ββββββββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ§βββββββββββββ§βββββββββββββ§βββββββββββββ§ββββββββββββ
By the way, I was quite surprised by the difference in hit ratio between theine and caffeine on P3 (25000 capacity) 8% vs 14%.
Yes, perhaps the main problems with S3-FIFO are the inability to adapt and the possibility of an element being pushed out for O(n) and any attempts to fix this can lead to a worse hit ratio. And in the short term, I don't know what to do about it.
Go really still doesn't have efficient concurrent data structures, there are only implementations scattered across different libraries based on research papers or ports from other languages, but unfortunately most of them are barely supported. This is largely what motivated me to build an efficient cache library under such limited circumstances.
It looks like the ARC traces are no longer public due to an overzealous Dropbox security restriction, and unfortunately I did not use P3 regularly or recently enough to have it locally. I emailed the author to ask if they can be made available again, but if you have all of the downloads I might need you to publish them.
@ben-manes I just took the traces from ristretto benchmarks https://github.com/dgraph-io/benchmarks/tree/master/cachebench/ristretto/trace (better to download individually as the repository is very large).
Hi all, I missed the fun party! This is the author of the S3-FIFO algorithm. @maypok86 I just created an issue (https://github.com/maypok86/otter/issues/17), I believe there is a potential bug causing more objects to be evicted. I think this should fix the fluctuation problem. However, It is true that is WTinyLFU is better than S3-FIFO on the DS1 and S3 traces. I believe this partially comes from not being adaptive, and this is one direction we are working on. In general, S3-FIFO is more suitable for web workloads (e.g., key-value cache, object cache), which follow skewed popularity with popularity decay.
On block workloads, especially the old ones (like the ARC ones, which are more than twenty years old), S3-FIFO may not outperform W-TinyLFU. Similar results can also be found in our paper https://dl.acm.org/doi/10.1145/3600006.3613147
With that being said, I believe if
Hi @ben-manes, can you give more description about the scalability benchmark you showed in the figure? I am interested to learn how you improve LRU's scalability. We did have a prototype in which we observed much better scalability than the TinyLFU in the cachelib library.
@1a1a11a I'm certainly not Ben, but it seems that bp-wrapper and a set of wait-free lossy buffers can give even greater scalability than lock-free queues. Articles on caffeine architecture: first and second
Cachelib mimics memcached's scheme of an exclusive lock (optionally try-lock) and a per-item time interval between policy updates (60s). That was good enough for there internal needs.
Caffeine uses dynamic striped, lock-free, lossy mpsc ring buffers to sample the request events. This differs from the author's BP-Wrapper implementation where they used an isolated system with a dedicated cache (Postgres), so they leveraged thread-local buffers as the memory usage and thread counts were known up front. That didn't make sense for a general purpose library embedded into an application which might have hundreds of instances, and each with different scalability needs. It is their design but with a different implementation approach.
The dynamic striping (number of read buffers) increases as contention is detected, mirroring how a scalable counter is implemented. The ring buffers capture the events cheaply and if full then it is dropped. Since a cache is probabilistic and trying to detect a usage pattern for hot-cold entries, this lossy behavior does not meaningfully impact the hit rates. Whether one uses a lossy buffer, a time threshold, a fifo clock, etc. the insight is the same that a perfect history of the access order is not required to make good predictions.
The writes are also buffered using a bounded mpsc queue, which grows up to a threshold. This allows the cache to absorb most independent writes without threads contending on each other and batch the work to be replayed under the lock. These writes are a cache miss which indicated that the application had to perform an expensive load (10ms+). While in applications the write rate won't exceed the eviction rate, it can happen in a synthetic stress test. In those cases back-pressure is applied to the buffering, which assists in batching the work to avoid context switches or lock contention. Generally writes are limited by the hash table's write throughput, as shown in those benchmarks. The hash table offers lock-free reads, uses dynamically striped locks for writes, offers computations under the fine-grained lock, and protects against hash flooding by switching the buckets from linked-list to red-black trees when needed.
The buffering of events allows us to schedule the draining as appropriate, either after a write or when a read buffer is full. That uses a simple state machine and a try-lock to coordinate, so the eviction policy does not suffer from lock contention. As BP-Wrapper describes itself, it "(almost) eliminates lock contention for any replacement algorithm without requiring any changes to the algorithm". This allows for more exploration of algorithms and data structures because the policies do not need to be thread-safe.
The benchmark that I use has a pre-populated cache with a zipf distribution for the access requests. That creates hot spots as some entries are used much more often than others, just like a real cache would experience. A common mistake is a uniform policy which evenly spreads the contention, which would imply random replacement is the ideal policy. A skewed distribution means that locks suffer higher contention so it exacerbates that as a problems, while also benefiting contention-free implementations who can better utilize the cpu cache. In the real world, 50-70M reads/s is a good bar to reach to satisfy most users and the focus turns towards other areas like hit rate and features.
you need a very general cache replacement that can handle both block and web workloads
Yes
and don't mind the complexity
Yes. There are more pieces, each simple, which offer greater predictability. It is a tradeoff of a little more upfront work for fewer runtime surprises.
don't mind scalability
No. If we assume zero overhead (an unbounded map) then the measured per operation overhead is,
1.06 nanoseconds β 0.435 nanoseconds β 0.625 nanoseconds
A Clock-based policy would fall in the middle, e.g. 0.7 ns. At these numbers the cache would not be the scalability bottleneck and it would be bounded by something else in the system.
ok with extra metadata overhead
This seems debatable and would require being more explicit about the calculations.
If we assume a traditional ghost history based on a doubly-linked list, closed hashing, and 64-bit machine then
List node (2 x 8 bytes) + Hash Entry (2 x 8 bytes) + key hash (4 bytes) = 36 bytes
This does not account for wasted space due to the load factor, object headers, etc that might be required.
If optimized for space, we can say the lower limit must be at least the 32-bit key hash (4 bytes). As in, we increase the time complexity for find / add / delete through an array of hashes. This seems to be an unlikely implementation choice and it will be higher in practice.
Caffeine uses a CountMinSketch sized at 8 bytes x ceilingPowerOfTwo(maximumSize)
to hold 4-bit counters. This means that each entry adds 8-16 bytes. The paper discusses the doorkeeper optimization (adding a bloom filter) to reduce the size of the CountMinSketch. In my experiments this allowed the total memory overhead to be reduced by 50% without meaningfully impacting the hit rate. Thus, we are looking at a range between 4-16 bytes per entry for maintaining the history metadata.
The extra metadata overhead of maintaining the history seems to be, at best, equivalent. In practice it appears more likely that the frequency sketch will be a lower cost than a ghost list. The size of the ghost list is the unknown factor, e.g. @maypok86 uses 0.9x. If it was instead significantly smaller than the total capacity, e.g. 0.2x, then it could be favorable.
The resident per entry metadata overhead is less clear. W-TinyLFU describes only an algorithmic structure and the paper discusses the implementation choices made by Caffeine for the purpose of an evaluation. Ristretto decided to use sampled LFU with TinyLFU (doorkeeper + count-min) to reduce the metadata overhead, based on their analysis that their use-case was frequency-biased. It appears to me that S3-FIFO and W-TinyLFU would be equivalent for resident entries in their common implementations.
Ugh, I urgently need a reaction with popcorn on github. I'm probably not competent enough to argue about S3-FIFO and W-TinyLFU. Let me just say that if you only need find and add operations in a ghost queue, you can manage with about 18 additional bytes if the hash takes 8 bytes (if it takes 4 bytes, you need half as many bytes), but if you need to remove random items from it when changing the algorithm, things get bad
It's not so much an argument for or against, its that there are a lot of design choices where those details matter. I find it frustrating when sweeping generalizations are made that conveniently leave out important caveats or alternatives that a reader should be aware of because it may weaken the author's thesis. As an engineer, I appreciate having a richer picture because I need to balance tradeoffs, support the usage long-term, and cope with future requirements. This in conflict with an academic audience where goals are more short-term, e.g. peer review and that typically the work is discarded after publication. When the discussions are oriented towards an engineering audience then going into more breadth and depth over the facts matters so that the tradeoffs are understood. Usually that should ends up with a mixture of ideas, like how you applied insights from both BP-Wrapper and S3-FIFO, to discover the right design fit for the target environment.
Hi @ben-manes, Thanks for the super detailed explanations! A few comments and questions.
I like the TinyLFU design, and it shows the second best results in our evaluation (with a window size of 10%). I have no doubt calling it the state-of-the-art.
The figure above shows the evaluation of 6594 traces from 14 datasets (CDN, kv, and block). Whether S3-FIFO is truly better than TinyLFU is hard to say; we have traces where S3-FIFO is better (especially for web workloads), and we also have traces where S3-FIFO is worse than TinyLFU (especially on block workloads). These empirical results depend on workloads, and since we do not have a good model of workloads, I will not argue that S3-FIFO is strictly better than TinyLFU on the miss ratio.
Regarding the metadata size, as a FIFO-based algorithm, if objects in the workload have a uniform size, we can implement the FIFO queues using ring buffers, and AFAIK, several production cache implementations (e.g., TigerBeetle, VMware) do use ring buffers.
For the ghost entries, we implemented them as part of a bucket-based hash table to take advantage of the over-provisioned space. They are cleared out during hash collisions. This means we used no more than 8 bytes (4-byte timestamp, and 4-byte fingerprint) per ghost entry. Since there are a similar number of ghost entries as the entries in the cache, we can estimate that the ghost entries add no more than 8 bytes per cached object. Our sensitivity analysis shows that we can reduce the number of ghost entries to 40% of the cached entries with almost no impact. In summary, the ghost entry adds fewer than 4 bytes per object.
I appreciate the efforts put into making Caffeine very scalable. However, the FIFO queues in S3-FIFO do not need any fancy technique to be scalable. All we need is atomics. I implemented it in C/C++, and everything is straightforward. However, I recently realized that it is non-trivial to have scalable and efficient implementations in other languages, e.g., Java and Rust. But the scalability in S3-FIFO is fundamental in the design.
1.06 nanoseconds β 0.435 nanoseconds β 0.625 nanoseconds
Can you explain what the numbers are?
I still doubt how Caffeine can achieve 1000 MQPS on 16 threads. The numbers indicate that a hash look-up and recording the request on a ring buffer takes 16 ns. But CPU L1 access is ~1 ns, and DRAM access takes 100 ns. I simply don't understand how this is possible. This indicates that most data accesses (including the hash table) happen in the L1 cache.
A friend of mine implemented (not optimized) FIFO in C and benchmarked its scalability on a server with 72 cores (using a pre-populated Zipf request stream). However, she was only able to get ~100 MQPS.
Happy to learn more from you! :)
Great comments, @1a1a11a. I am very happy to see more caches moving beyond LRU, so a variety of simple, easy, or more comprehensive solutions is wonderful. There is a lot of low-hanging fruit beyond the classics.
One of the differences to keep in mind is that the TinyLFU papers are closer to design patterns than algorithms. They provide a structure and techniques but only suggest implementation approaches. That means that implementations can differ; for example, cachelib uses 32-bit counters in their CountMinSketch, whereas Caffeine's is 4-bit. The variety of choices can impact metadata overhead, hit rate, throughput, and concurrency support.
I don't believe that the ghost entry's cost of the hash table and FIFO queue can be dismissed to say it only takes 8 bytes (timestamp + fingerprint). A hash table typically has a load factor of around 75%, and may be higher for a cache since the maximum capacity is known. That leaves less space for the over-provisioned table size but also adds at least a pointer for the chained entry and maybe more (e.g. hash field if cached, uses a doubly linked list for faster removals, etc). The cost of the FIFO should also be accounted for as part of the ghost's overhead. A reasonable minimum estimate may be that the FIFO holds a 4-byte fingerprint, the hash table adds 8 bytes for the chain pointer, the table over-provisioning is dismissed for simplicity, the entry costs 8 bytes for the key and timestamp, and ghost region ranges from 0.4x to 0.9x of total capacity. That is then an incremental cost of 8 to 18 bytes of metadata overhead for each additional cache entry. That's great and not much higher than the 4-16 bytes per entry for TinyLFU.
This indicates that most data accesses (including the hash table) happen in the L1 cache.
Yes, that is the ideal because this micro-benchmark tries to isolate the costs to only measuring the cache in a concurrent workload. It would show bottlenecks such as due to hardware cache coherence, data dependencies, poor branch prediction, saturation of the logic units (there are multiple ALUs for every FP unit), (lack of) SIMD, waits on the store buffer, blocking and context switches, system calls, inefficient algorithms, etc.
The numbers indicate that a hash look-up and recording the request on a ring buffer takes 16 ns. But CPU L1 access is ~1 ns, and DRAM access takes 100 ns.
A modern CPU core is superscalar (6 wide) and fuses multiple operations into a bundle (8 micro-ops). If we avoid data dependencies and use simple integer instructions, then we'll see higher CPU utilization as it is much more work than a single instruction per cycle. That could mean at best a CPU core can perform is 48 instructions per cycle, which at 4GHz means 192 instructions per nanosecond; or 2688 instructions per nanosecond across all 14 cores.
The benchmark's overhead is thread-local work for an index increment, array lookup, loop, and operation counter. A read performs a hash table lookup, a few liveliness validation checks, hashing, and maybe an addition to a ring buffer. The hashes are very inexpensive bitwise and multiplication operations. Since the ring buffers for recording a read are lossy and drained by a single thread, they fill up very quickly and are usually skipped. In those cases there is no blocking on a lock, no CAS, no volatile writes, etc. so that at maximum load it stays close to the hash table's performance. If we introduce a cost like data dependencies (such as by using a random number generator to select the next key), we'd see this fall sharply as we no longer allow the CPU/compiler to use the hardware's full potential.
I refer to BP-Wrapper as request sampling because it sheds load (policy maintenance) by not recording the read in the ring buffer. The popular items will reenter the buffer at a higher frequency so when the policy drains the read buffer it still gets to observe the heavy hitters and the policy can make a reasonable prediction without slowing down the application.
A friend of mine implemented (not optimized) FIFO in C and benchmarked its scalability on a server with 72 cores (using a pre-populated Zipf request stream). However, she was only able to get ~100 MQPS.
There are many easy mistakes that could severely diminish throughput.
hash % table.length
becomes hash & (table.length - 1)
. A 32-bit hash may be cheaper than 64-bit by packing more work per hardware logic unit. An expensive algorithm like SipHash might be replaced by a very simplistic hash by using red-black tree bins to mitigate collisions.I tried to update the otter benchmarks today and tried to replicate the caffeine benchmarks and there seems to be a reason for these numbers in the caffeine benchmark results.
Caffeine uses 2 << 14 = 32768 capacity (cache pre-filled) and scrambled zipfian from yahoo.
And on such a benchmark I got fantastic speed from otter.
goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 11247033 102.2 ns/op 9780838 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 24089149 44.99 ns/op 22225554 ops/s 27 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 18038624 61.85 ns/op 16167099 ops/s 108 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 21097772 53.16 ns/op 18811359 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 240418147 5.085 ns/op 196646932 ops/s 0 B/op 0 allocs/op
PASS
ok github.com/maypok86/benchmarks/perf 6.965s
But if I increase cache capacity to 2 << 20 = 2097152 (used in ristretto benchmarks), the results are already decently worse and I suspect it will be the same with caffeine.
goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 8520392 127.9 ns/op 7819737 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 19632382 56.17 ns/op 17804392 ops/s 27 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 13077836 99.85 ns/op 10015271 ops/s 86 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 15576997 75.88 ns/op 13178764 ops/s 41 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 67229958 16.86 ns/op 59302586 ops/s 0 B/op 0 allocs/op
PASS
ok github.com/maypok86/benchmarks/perf 21.023s
I also don't understand why benchmarks are configured so that elements are not evicted at all. It doesn't seem to be very representative of real world loads.
Of course, I can attach such pictures in the readme, but I have big questions about the honesty of such benchmarks.
I also don't understand why benchmarks are configured so that elements are not evicted at all. It doesn't seem to be very representative of real world loads.
This is because it is easy to write an incorrect benchmark or to misunderstand its results. Therefore that benchmark is very narrow to answer a very specific question. In this case that is if the cache is a fundamental scalability bottleneck that may harm the application's performance. A thorough analysis should ask multiple, narrow, verifiable questions that each have their own micro-benchmark for understanding the system. In no way is that single benchmark enough, but it does answer a common question and counters the belief against LRU. That question (if the cache hit penalty is a bottleneck) and its answer does not mean that being infinitely faster is important; there is a point of diminishing returns and the answer is "yes or no" not "by how much".
Caffeine does include an eviction benchmark to measure that worst case throughput. Otherwise the questions became much murkier and easy to misinterpret, so they were written as needed and not included in the repository. The concern is that if included that might imply that they are valid and trustworthy comparisons rather than exploratory throwaways during development. You should write more benchmarks for analyzing your cache, but you should also be conservative publicly to avoid others misrinterpreting the results.
An example of trying to follow your logic and making an easy mistake was in RedHat's benchmark. It is much faster to perform a cache miss than a cache hit because there is no additional work for maintaining the eviction policy. That penalized caches with higher hit rates as an increased throughput was observed when there are more misses. They didn't realize this and misunderstood their measurements, so during a rewrite their newer implementation suffered very high contention on a cache hit and the lower hit rate made it appear to be an improvement. Your question is good and worth exploring, but be very careful due to how trivially easy it is to make an invalid comparison or an honest misunderstanding of the results.
A few related talks that I found useful,
Interesting idea about the benchmark separation.
Let me share my yesterday's adventures (and conclusion). I think it might be useful to someone.
I ran the benchmarks in the same configuration as caffeine on two versions of otter:
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 11870023 100.4 ns/op 9960592 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 28630963 43.99 ns/op 22732079 ops/s 26 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 17515051 60.59 ns/op 16503227 ops/s 106 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 21355162 53.30 ns/op 18761140 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 86992596 14.91 ns/op 67050184 ops/s 0 B/op 0 allocs/op
PASS
ok github.com/maypok86/otter/perf 7.752s
type Node[K comparable, V any] struct {
key K
value atomic.Pointer[V]
...
}
And the results were much better (you've already seen them):
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 9978348 115.1 ns/op 8690168 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 25813344 48.54 ns/op 20603083 ops/s 27 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 18204561 64.09 ns/op 15602573 ops/s 113 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 19909934 55.59 ns/op 17987431 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 210819736 5.589 ns/op 178916881 ops/s 1 B/op 0 allocs/op
PASS
ok github.com/maypok86/otter/perf 7.292s
So it looks like no one in go can come close to the throughput of caffeine without additional memory overhead and pressure on gc. Otter of course already uses a set of tricks to reduce gc pressure and use extra memory, but such changes actually crosses them all out.
We are running on different machines, so you would have to see what Caffeine's numbers were on yours. My latest numbers came from a brand new M3 Max macbook, where I bumped the benchmark from 8 to 16 threads. The three workload types would also be nice to see, and a comparison to the unbounded hash map as a control group. Java's ConcurrentHashMap is very fast so that might be what gives caffeine an advantage if it wins on your hardware.
Iβm unsure why youβd spin lock on read, but if important then you might find a stamped lock useful. Basically use a version counter so reads verify their stamp to ensure a consistent read, instead of blocking each other by acquiring exclusive access.
Oh, and your spinlock is TAS whereas using TTAS is usually preferred.
The problem isn't even so much the numbers compared to caffeine, but the fact that spinlock eats up a lot of time on such benchmarks (CLHT is the hash table that otter uses).
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 10268367 102.8 ns/op 9725720 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 23940507 45.51 ns/op 21972560 ops/s 27 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 19561802 61.60 ns/op 16234376 ops/s 111 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 20381444 53.41 ns/op 18721946 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_CLHT_reads=75%,writes=25%-8 280790150 4.242 ns/op 235737370 ops/s 1 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 194110579 5.995 ns/op 166793393 ops/s 1 B/op 0 allocs/op
PASS
ok github.com/maypok86/otter/perf 9.498s
With the spinlock, the otter is about three times slower
spinlock eats up a lot of time on such benchmarks
I think if you use a stamped spinlock then the optimistic read will usually succeed and the spinlock overhead will mostly disappear. But Iβd also try TTAS first.
@ben-manes Why am I using spinlock?
To update asynchronously the received Entry from the hash table when calling Set, since there is no synchronized or anything like that in golang. And I was even able to save 4 bytes for this. Also, since it makes it necessary to atomically do getValue when calling Get on the cache instance I use the same spinlock.
There is a problem with StambledLock - it takes much more memory, judging by the java implementation.
And TTAS spinlock doesn't seem to save the situation much either
func (sl *SpinLock) Lock() {
acquired := false
for !acquired {
if atomic.LoadUint32((*uint32)(sl)) == 0 {
acquired = atomic.CompareAndSwapUint32((*uint32)(sl), 0, 1)
}
}
}
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 10992760 104.5 ns/op 9566058 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 26589556 46.96 ns/op 21294953 ops/s 28 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 17358097 62.27 ns/op 16059811 ops/s 107 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 20358709 54.12 ns/op 18478836 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_CLHT_reads=75%,writes=25%-8 267684039 4.182 ns/op 239110129 ops/s 1 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 76844876 14.15 ns/op 70652962 ops/s 0 B/op 0 allocs/op
PASS
ok github.com/maypok86/otter/perf 7.847s
There is a problem with StambledLock - it takes much more memory, judging by the java implementation.
yeah, they tend to write more bullet proof libs due to many Java devs not understanding concurrency well. I thought you might be able to borrow some of the core ideas instead of the exact implementation. Since you don't use this to support blocking you can use a simpler sequence lock. That's basically a counter where even is read mode and odd is write mode, a reader waits until it obtains a read mode stamp, reads the data, and validates that the stamp was unchanged. That lets you perform an optimistic read without blocking other readers, and works well when there is a low write rate (else better to preemption instead of busy wait). This wouldn't require any more space than your current one, so a cheap but effective trick.
The TTAS is meant to avoid the CAS in the busy loop so that you only perform the CAS if observed to be unlocked. That way you loop on the cache line without memory bus traffic, wait for the cache coherency invalidation, and then race to acquire. Otherwise you are always penalized by the CAS trying to make a state change. Here's a version generated by chatgpt.
func (sl *SpinLock) Lock() {
for {
// Test (check if the lock is currently held)
for atomic.LoadUint32((*uint32)(sl)) == 1 {
// Spin while the lock is held
// This is the "Test" part of TTAS
}
// Now attempt to acquire the lock
if atomic.CompareAndSwapUint32((*uint32)(sl), 0, 1) {
// Successfully acquired the lock
return
}
// The lock was not acquired, retry the whole process
// This is the "Test" part of TTAS (retry if the lock is taken while attempting to acquire)
}
}
Here is a version using a sequence lock and TTAS. It took some fighting with chatgpt so it may be wrong, but gets the idea across I hope.
Seqlock could help, but maybe I don't understand something, but seqlock allows both readers and writer to enter a critical section, which sounds like data race. I can already see the faces of users who get warning from a third-party library when running tests with -race flag or trying to debug. I guess I could limit the implementation with seqlock with a build flag and a caption, "With this flag you get x3 throughput and data race as a gift".
The intent is that while you get a race while copying the data being read, you will then validate that it was a consistent or throw away that work and try again. That is why you pair the OptimisticRead with the ValidateRead(stamp) methods, wrapped by a retry loop until successful. Once validated then you know that the data was safely copied, so if the data is small like pointer reads then this is very inexpensive. It does not ensure exclusivity like a read-write lock (aka shared-exclusive lock), so you cannot perform mutations that would cause the readers to fail in their inconsistent read (like rebalancing a tree which might causing a loopy walk). That makes this optimistic concurrency more niche, but offers a very fast solution when a good fit.
I still donβt understand why you would lock on reads instead of only on writes. The value can be read by an atomic load while writing is made exclusive. You could then use lock striping instead of per entry locks by hashing to an array of mutexes. I thought Go copied the Java Memory Model so you have the guarantees needed to do this safely.
The intent is that while you get a race while copying the data being read, you will then validate that it was a consistent or throw away that work and try again.
This is clear, but the problem is that go data race detector (as far as I remember) triggers on any simultaneous read/write of one variable. And this may lead to the fact that users of such a library with seqlock will constantly get warnings when debugging or running tests with this detector.
The value can be read by an atomic load while writing is made exclusive.
Oh, if only it were that easy. The problem is that in Java all values are references and, of course, you can easily use atomics on them. But that's not the case in go. Values are copies of the original passed value (if an object contains pointers, such as strings, then this structure is copied, not the bytes in the string). And actually in go there are three ways to use atomic operations on such types.
Oh, thanks for that insight! That makes a lot of sense and certainly confused me in the past as to the over use of read/write locks in Go. Java has minimal tearing as value types are still in development.
I guess then maybe try the earlier TTAS spinlock that I provided in case it helps, while keeping it exclusive.
In general, TTAS doesn't help much either, but it has more stable results than TAS, so I'll switch to it and accept defeat, because the array of smart mutexes didn't help either. And it looks like nothing except atomics will help, because the amount of work under locks is too small.
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 8573086 131.4 ns/op 7610200 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 21819586 48.17 ns/op 20759453 ops/s 28 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 15848017 70.14 ns/op 14258177 ops/s 102 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 19319406 55.37 ns/op 18061154 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_CLHT_reads=75%,writes=25%-8 286224150 4.246 ns/op 235541757 ops/s 1 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 86898153 12.95 ns/op 77209914 ops/s 0 B/op 0 allocs/op
PASS
ok github.com/maypok86/otter/perf 9.003s
not sure if this helps, but you could try to suppress those warnings in a hacky way.
As of my last knowledge update in January 2022, there isn't a built-in annotation or directive in the Go programming language specifically designed to instruct the race detector to ignore a particular section of code. The race detector operates at runtime and analyzes the concurrent execution of goroutines, so it doesn't have a mechanism to selectively ignore parts of the code.
However, you can achieve a similar effect by using the // +build
directive along with build tags. For example, you could conditionally exclude certain files or sections of code from the build when you want to run the race detector.
Here's an example:
// +build !race
package main
func main() {
// Your code here
}
In this example, the code will be excluded from the build when the race
build tag is present. So when you want to run the race detector, you can build your program with the -race
flag:
go build -race
This way, you can isolate the parts of your code where race conditions are not relevant and exclude them from the race detector checks. Keep in mind that this approach may not be perfect, and you should carefully test and verify the absence of race conditions in critical sections of your code. Always strive to design your concurrent code to be race-free whenever possible.
Wow, that's quite a boost from seqlock...
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 10644934 110.2 ns/op 9075592 ops/s 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 23894643 46.67 ns/op 21426799 ops/s 27 B/op 1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8 17937655 60.45 ns/op 16543840 ops/s 111 B/op 1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8 20935054 53.07 ns/op 18842569 ops/s 43 B/op 0 allocs/op
BenchmarkCache/Zipf_CLHT_reads=75%,writes=25%-8 278835458 4.245 ns/op 235593697 ops/s 1 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 202141250 5.572 ns/op 179468488 ops/s 0 B/op 0 allocs/op
PASS
ok github.com/maypok86/otter/perf 8.537s
However, you can achieve a similar effect by using the // +build directive along with build tags. For example, you could conditionally exclude certain files or sections of code from the build when you want to run the race detector.
I thought about it, but I'll probably close it with an additional build flag, as I think most applications should be fine with 50-60 million qps throughput, and if it's really not enough and it's a cache issue, then the user can unlock a new level of performance.
@maypok86 Correct me if I am wrong, the problem you are trying to solve is to avoid updating the value being read. Can you just create a new node for the new value, update the hash table (pointing to the new value), and put the old value onto the deletion buffer? In this way, you do not need to lock the reads. But you would need a reference count to make sure the old node is not freed before it finishes reading. But it would be much cheaper than CAS operations.
@1a1a11a yes, but when you create an additional node, you will also have to update the eviction policy, and that costs a lot more CPU time. Also I have a suspicion that allocation + interaction with one of the limited number of deletion buffers may rather slow down write operations.
@maypok86 You will need to insert the new node, but you don't need to update the eviction policy because the old node will be reclaimed soon. Do I miss something here? Or is your concern about the temporary memory usage increase? If so, you could reduce the cache size by the maximum number of concurrent writes (e.g., 16 or 64). But in my very limited understanding, the temporary memory usage increase should not be an issue due to the GC.
Regarding slowdown, it well depends on your use case; if the workload is read-heavy (>80% read), removing the lock at read would improve the performance. But if your workload is update-heavy, you are right; we may experience a slowdown. There are write-heavy workloads in the wild, but I haven't seen any update-heavy workloads.
@1a1a11a but the eviction policy must also store a reference to the node and without a direct delete command from the eviction policy GC will not delete it.
Hi @maypok86, Thank you for the reply! Can you elaborate on your point?
Hi @1a1a11a, I suspect that in proposing such a solution, you are assuming that the small and main queues simply store keys in ring buffers (otherwise I don't see how this could work at all). But there is a set of problems here at once:
Hi @maypok86, I am not assuming a ring-buffer implementation. I have read your repo, and the linked-list implementation is reasonable. I might have missed some critical background because I still do not understand why an update would affect eviction. Can you help me better understand the missing piece? Here is my mental model.
Upon an update request,
This is effectively an RCU operation and has been widely used in many systems.
Let me know if I misunderstood some parts that may make this not possible. If you are interested, I am happy to have a call.
@1a1a11a In my world, it breaks the eviction policy here's why:
Thank you! @maypok86 I see the problem now.
In my mental model, update = insert + delete, and the new node (with the updated value) is placed at the head of the queue, but you prefer keeping the node in the old position in the queue when updating the value.
What would be the problem if the new node (with an updated value) is placed at the head of the queue (with prev pointing to null, and next pointing to the old head)?
The prev and next node of the old node remain unchanged until the old node is removed.
@1a1a11a Yes, you can do that (moreover, otter did it in a similar way for some time), but such actions would require locking of eviction policy + time to select removal buffer and cas operation to insert there + allocation and time for gc (also insert into hash table also requires locking). And this is very similar to the time of a regular insert, although you could do away with a lock per node. This is where I have to seriously think about what really matters and how it's best. I would have chosen seqlock without a doubt if it weren't for the race :(.
Just saw this pass by on Hacker News: https://blog.jasony.me/system/cache/2023/08/01/s3fifo. Seems interesting. I wonder if it outperforms (not just in hit ratio, but also CPU time to calculate whether to evict) TinyLFU.