cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.83k stars 451 forks source link

cache: experiment with manually managing memory #31

Closed petermattis closed 5 years ago

petermattis commented 5 years ago

cache.Cache currently caches entries where the value has been allocated by the normal Go allocator (i.e. each value is individually allocated). The upside to this approach is simplicity and that correctness is easily achieved by leveraging the Go GC. The downside is that this adds significant work for the Go GC. For example, consider a 16 GB cache composed of 32 KB blocks (the default block size in Pebble). The cache would contain 512K blocks. Each GC cycle needs to scan the pointers in memory, looking for unreachable data. The more individual pointers, the slower the GC cycle.

Rather than using the Go allocator for each cached block, we could manually manage the memory. The thinking here is to allocate a relatively small number of large chunks of memory and to break that memory into pieces. We'd want to do this in a way that the cache memory is hidden from the Go GC (otherwise we'd just be creating an equivalent number of pointers for the GC to follow). This can be accomplished by using mmap to allocate chunks of memory. Examples of doing this in Go are easily found. The challenge with this approach is correctness: how to determine when a chunk of memory retrieved from the cache is no longer being used. The brute force approach is to require every user of the cache to release memory that is retrieved from the cache. Or we can try to do something more subtle using finalizers: retrieving an object from the cache allocates a new object with a finalizer attached. When the finalizer runs it decrements a reference count on the entry. This partially leverages the Go GC and might be simpler to implement, though this usage of finalizers would have to be thoroughly tested.

Before doing anything significant here, it will be useful to develop a benchmark showing the overhead of the current (Go allocator) approach for multi-gigabyte size caches.

petermattis commented 5 years ago

See https://allegro.tech/2016/03/writing-fast-cache-service-in-go.html#omitting-garbage-collector.

tbg commented 5 years ago

https://blog.dgraph.io/post/caching-in-go/ seems relevant, too:

Meanwhile, we came across Caffeine, a Java library used by Cassandra, Finagle and other database systems. It uses TinyLFU, a highly efficient cache admission policy and uses various techniques to scale and perform well as the number of threads and cores grow, while providing a close to optimal hit ratio. You can read more about how it works in this article.

Caffeine meets all the five requirements I mentioned at the beginning, so we are looking into building a Go equivalent of Caffeine, something which can serve our needs and potentially fill the gap of a concurrent, performant, memory-bounded cache in the Go language. Let us know if you’d like to contribute or if you have something similar built already!

petermattis commented 5 years ago

Definitely relevant. In my explorations of caching libraries I had also come across Caffeine. Go has nothing similar. It does have go-tinyflu which is a Go implementation of the TinyLFU algorithm. Not usable as-is, but it could provide inspiration. The buffering of read and write operations in Caffeine is a super interesting idea and looks to be key to performance. Instead of performing LRU/LFU manipulations synchronously, they are buffered and performed in batches.

There doesn't seem to be a good concurrent hash table in Go. Sharding the cache (or even just the hash table) and protecting each shard with a lock is a possibility. The Non-Blocking Hash Table from Azul (Java) looks super interesting. I highly recommend that paper for how uses a state machine approach to reason about correctness. There is also a video presentation of those slides.

ben-manes commented 5 years ago

A valuable feature of a cache is to load (compute) the value atomically on a miss, and have other callers block waiting. This is sometimes referred to as memoization. It avoids cache stampedes that overwhelm the underlying resource due to duplicated work.

A lock-free hash table is nifty, but doesn't offer this ability. You definitely want lock-free reads, but fine-grained locking for writes is probably better. Java's rewritten ConcurrentHashMap does this and users have a lot of flexibility due to the various compute methods. Otherwise you have to emulate it yourself using lock striping, which is probably okay and removes any marginal gains due to lock-free writes. Since Go lacks a good concurrent hashmap, the choice is probably which design is easier to implement.

petermattis commented 5 years ago

@ben-manes That is a good point about cache stampedes, though I'm pretty sure RocksDB doesn't attempt to avoid them. I'm struggling to think of a workload that would exhibit a cache stampede problem that would last for more than a second on a storage engine like Pebble / RocksDB. That could just be a failure of imagination, though.

ben-manes commented 5 years ago

You're probably right. From an application-view, a self populating cache is quite valuable. At the low-level it may not be...

petermattis commented 5 years ago

I did some experimentation with the Go GC today to see what impact the cache size would have on GC times if we left the cache in Go managed memory.

My experiment was a toy program which allocated a cache of 1 TB composed of Y fixed size blocks. I varied the block size in order to simulate different cache sizes, operating under the assumption that the impact of a large cache is the number of pointers that need to be chased. In order to ensure that GC was running regularly, another goroutine sat in a loop allocating memory.

  blocks  gc-cpu(ms)
     64k          10
    128k          14
    256k          16
    512k          20
   1024k          31
   2048k          54
   4096k         110

Note that these are not pause times as the Go GC is concurrent and its stop-the-world phase is quite short. The above reflects the tax of a large cache on the CPU available to the program.

CockroachDB uses 32KB blocks. 512k 32KB blocks would fill a 16TB cache, and the CPU tax would be <1% (I'm assuming total CPU time available is 1s * num-CPUs). This is the best-case scenario as there is only one pointer per block and the GC throughput is related to the number of pointers that need to be traversed. The real cache would need a hash table to find blocks, as well as some sort of LRU structure for eviction. So figure 3 pointers per cached block, perhaps more.

This all seems to be reasonable overhead. I'm going to do some more experimentation and make this somewhat more realistic, but for now I'm thinking that we can leave the cache in Go managed memory.

ben-manes commented 5 years ago

There is early work being done in dgraph-io/ristretto by @karlmcguire. So far that's exploring tinylfu and concurrency, but they also want to look at allocation later. Might be worth collaborating in the future as both efforts are exploratory modes right now.

petermattis commented 5 years ago

@ben-manes Yep. I'm keeping my eye on ristretto. I don't have any cycles to contribute to it at this time, though

petermattis commented 5 years ago

As shown above, a large Go heap is not a significant problem in and of itself. The problem is significant churn in the heap. This shows up clearly in profiles of read-only workloads where the working set is larger than the cache size. In particular, the slice allocation in sstable.Reader.readBlock shows up as a hot-spot for GC assist activity. I think there is something relatively straightforward to do here: keep a small cache of recently deallocated cache blocks and try to reuse that memory for the next allocation.

In Pebble (and in RocksDB), the data block sizes are fairly uniform, clustering around the configured block size. The final data block in each sstable has a wider size range being randomly in the range [0-blockSize]. The index blocks are a bit different, with their sizes varying more widely. My current thinking is to target the 99% case of data blocks that are close to the configured block size and attempt reuse memory for those blocks. Non-standard size blocks would be GC'd as normal.