petermattis commented 7 years ago

Running ycsb --concurrency 600 --splits 5000 against denim (a 6-node cluster) shows the following throughput:

Similarly, latencies slowly climb:

Pre-splitting the ycsb table into 5000 ranges means that the number of ranges is constant over the lifetime of the test. The most interesting metrics that shows an increase and could account for this performance decline are the disk metrics:

Each node is configured with the default 1/4 physical memory of cache, which in this case is 7GB. Each ycsb write is ~1KB in size. Writing at 2K/sec should generate ~7GB/hour and the graphs show we generated ~31GB when disk reads started. Are reads starting to miss in the cache? That's somewhat surprising given the skewed distribution for reads. Perhaps the system just reached a point where it is doing a significant number of background compactions continuously and those compactions are impacting foreground work.

petermattis commented 7 years ago

Compactions did jump right when disk reads jumped:

Doesn't quite seem like it would account for the performance degradation, though perhaps I'm underestimating the cost.

bdarnell commented 7 years ago

The start of the latency and disk i/o increase corresponds to the blue line in the memory graph reaching its peak. This is the node_exporter graph, so "cached" here would be the kernel's cache, not the rocksdb block cache.

bdarnell commented 7 years ago

And the "Disk Space Used" graph matches up: space usage on /mnt is also growing linearly and its values track the "Cached" line exactly until it hits the limit at ~16GB. So this is just a case of performance slowing down once the data gets too big for memory (and the OS cache gives us a much larger cache than the rocksdb block cache alone). The read distribution is skewed, but this still means that we're going from a 0% miss rate to a non-zero miss rate when the cache fills up.

bdarnell commented 7 years ago

Compactions also need to read ~everything, whether the application's read distribution is skewed or not. So the reads here could be coming from compactions instead of the application (at the start of the process sstables can stay in the cache for their entire life until they're compacted away).

petermattis commented 7 years ago

Good point about the reads possibly coming from compactions. The slow down might be expected. An additional experiment to run would be to change the size of the data ycsb writes.

petermattis commented 7 years ago

@spencerkimball To add simple RocksDB latency metric to see if that is the source of the slow down.

spencerkimball commented 7 years ago

Kicking this back into your court. Not sure those latency stats will help. Let's get a new build pushed to a cluster which restarts nightly so we can see how that latency metric evolves over time.

petermattis commented 7 years ago

I'll get a build running on denim which isn't being used at the moment.

petermattis commented 7 years ago

This is still being investigated, but it is highly unlikely anything will be addressed in the 1.0 time frame. Punting to 1.1.

petermattis commented 7 years ago

denim has been running a 100% write workload for the past 3 days:

Performance started at ~14k ops/sec and has declined to ~12.5k ops/sec.

The Raft commit latencies (apologies for the different graph scale) show a minor increase over that time. More interesting than the minor increase is the relative stability of the different commit times across nodes. And the blips in commit times on denim 2 correspond with dips in the ops/sec numbers. I'm not sure what is causing the blips in commit times on denim 2. They seem to correspond to less disk I/Os.

petermattis commented 7 years ago

Getting back to this and reproduced this on 3383ce322beba14f293dbc0542ae7ec8c750409f using ycsb --splits 5000 --concurrency 1000. Throughput is slowly declining and latencies are rising fairly fast:

Another interesting bit is that we appear to be leaking Go memory:

That could just be the growth in the various caches, but it seems suspicious. Unfortunately, I have heap profiling disabled on the cluster. RocksDB commit latencies are essentially flat. Still poking around.

petermattis commented 7 years ago

Re-running with heap profiling enabled shows a lot of memory being allocated by Replica.propose:

         .          .   2784:   // Add size of proposal to commandSizes map.
         .          .   2785:   if r.mu.commandSizes != nil {
         .   350.64MB   2786:       r.mu.commandSizes[proposal.idKey] = proposal.command.Size()
         .          .   2787:   }

inuse_objects shows:

         .          .   2785:   if r.mu.commandSizes != nil {
         .     109370   2786:       r.mu.commandSizes[proposal.idKey] = proposal.command.Size()
         .          .   2787:   }

We're doing ~2.5k {inserts,proposals}/second, so this seems high.

Cc @irfansharif.

petermattis commented 7 years ago

Doing the math above indicates that r.mu.commandSizes is holding on to 3.3KB per proposal. That's fairly insane. And this is coming from runtime.hashGrow.

irfansharif commented 7 years ago

Re-running with heap profiling enabled shows a lot of memory being allocated by Replica.propose

this was a silly oversight, fixed in #16283.

petermattis commented 7 years ago

16283 fixed the egregious performance degradation, but the slow performance degradation is still present. The graphs below are from running `ycsb --concurrency 1000`. The first period on the graph is for `--splits 5000` and the second with `--splits 10000`.

My initial suspicion upon seeing the first run was that performance started to degrade when the initial set of 5k ranges began to split. The second run disproves that hypothesis as we start with 10k ranges and they never split and yet performance still degrades.

My next suspicion is that performance is degrading when we hit the next threshold of RocksDB read amplification. Note that we configure RocksDB for levels L0-L6 and that we omit the bloom filters for L6 (otherwise we blow out memory usage).

The graph doesn't match up precisely. I can imagine various reasons why not (the read amplification metric is a snapshot in time). I'm going to run an experiment where I re-enable bloom filters for L6.

petermattis commented 7 years ago

Enabling bloom filters for L6 didn't help. Not surprising given that I was misreading the read amplification graph and we're only up to L5.

We see an increase in disk reads as the performance declines.

Memory usage shows "cached" memory steadily rising until it plateaus. Shortly after that we start to see reads.

This is getting back to @bdarnell's earlier observation. Perhaps the working set has exceeded the memory capacity and what we're seeing is the increasing reliance on disk. I'm still surprised by this slowdown for this workload as we're doing 95% reads and the reads are skewed. I would expect nearly all of the reads to hit the block cache. Similarly, the writes are all to virgin keys and shouldn't be resulting in reads. So the reads are probably being caused by compactions. Is it just the increased disk utilization that is causing a problem?

RocksDB allows us to throttle the bandwidth used for compactions. I'm going to experiment with that.

jordanlewis commented 7 years ago

Is it possible to add instrumentation to the block cache so we can see whether the percentage of hits decreases over time?

Same question about compactions. Seems like these metrics would be really helpful for understanding what's going on.

petermattis commented 7 years ago

We do have metrics for the block cache hit rate and compactions. We should definitely add a graph for the hit rate. The compactions might not tell us much as we're probably doing compactions constantly, but the compactions themselves do not necessarily do disk reads as the reads might be hitting the OS buffer cache.

petermattis commented 7 years ago

We see a decline in the block cache hit rate from ~95% to ~90%. This might be due to compactions as the block cache is only filled on the first read of a block. I need to think about this more.

petermattis commented 7 years ago

I performed a rolling restart of the nodes on denim in an attempt to determine if the slowdown was due to something accumulating in the cockroach nodes. The answer appears to be no. Performance is actually slightly lower after the rolling upgrade.

petermattis commented 7 years ago

Two new theories which I'm exploring:

Background compactions are consuming too much disk bandwidth. We can specify a rate limit in RocksDB for how much bandwidth compactions can consume.
Background compactions are consuming too much CPU. We configure RocksDB to use 1 thread for memtable flushes and N-1 threads for compactions where N is the number of CPUs.

The default Cassandra config limits compactions to 16 MB/sec.

petermattis commented 7 years ago

Currently, we allow num-cpu threads for compactions and unlimited bandwidth. Reducing the number of threads to either 2 or 3 prevented RocksDB from keeping up with compactions eventually leading to too many L0 sstables which blocked writes. Limiting write bandwidth for flushes/compactions to 16 MB/sec resulted in a similar situation.

It seems like RocksDB does not provide a tunable we desire. RocksDB reserves one of the compaction threads for flushes (i.e. for writing a memtable to L0). The other threads are all used for compactions. But really we want at least 1 thread reserved for performing L0 compactions because read performance is impacted by more L0 files and too many actually causes writes to stop (we can configure this, but I've set the value to 48 and still seen writes stopped).

Somewhat curious is the amount of disk writes we're performing. The cluster is doing 2.5K inserts/sec. YCSB writes a row containing 10 100-byte fields. So figure 1KB per row. That translates into 2.5 MB/sec of raw data being written. Add the 3x replication and the Raft log writes and the RocksDB WAL and we're up to 30 MB/sec across the cluster. But each disk is doing 45 MB/sec of writes for an aggregate of 270 MB/sec. That discrepancy seems to high to attribute to RocksDB write amplification. Is my math off somewhere?

petermattis commented 7 years ago

The Raft log WriteBatch for a YCSB insert is 1250 bytes and the applied command WriteBatch is 1500 bytes. So the math above should really translate into 40 MB/sec of write bandwidth. We're seeing 270 MB/sec which is over 6x write amplification. I suppose that's possible. Seems high, though.

bdarnell commented 7 years ago

That discrepancy seems to high to attribute to RocksDB write amplification. Is my math off somewhere?

I have noticed rocksdb doing a lot of compactions of the high level SSTables, leading to more compaction I/O than one would expect. Maybe we're tuning some of the compaction parameters poorly? We should try comparing to the two "preset" modes OptimizeLevelStyleCompaction and OptimizeUniversalStyleCompaction. The "universal" style is documented as reducing write amplification but increasing space usage.

petermattis commented 7 years ago

Maybe we're tuning some of the compaction parameters poorly?

Very possible. I'm diving again into the RocksDB parameter space, but first I'm going to verify the degradation is on writes. I'm adding a --write-duration flag to ycsb that will switch to read-only mode once that duration has elapsed.

petermattis commented 7 years ago

Here's a 1h run where writes stop after 30m:

I'll do a longer run over night where we perform writes for 12h before shutting them off. That should indicate whether the slow down is purely a slow down in write throughput. In the meantime I'm going to be experimenting with RocksDB settings as it appears our write amplification is too high.

petermattis commented 7 years ago

Here are graphs from a longer run where write ops were configured to stop after 12h:

screen shot 2017-06-09 at 10 58 13 am

screen shot 2017-06-09 at 10 58 35 am

screen shot 2017-06-09 at 10 58 46 am

Seems pretty clear from this that the performance degradation we're seeing is isolated to writes. This run included some minor tweaks to our RocksDB configuration. Specifically, I disabled subcompactions, bumped the write buffer size from 8 MB to 16 MB and bumped the size of L1 from 16 to 64 MB. This seems to have made a modest improvement to the performance degradation.

I also enabled periodic output of RocksDB compaction stats:

** Compaction Stats [default] **
Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0       0.10   0.5      0.0     0.0      0.0     111.0    111.0       0.0   0.0      0.0    152.7       745    10245    0.073       0      0
  L3      4/0      36.19   0.8    185.5    96.5     89.0     181.2     92.3       0.0   1.9    172.2    168.3      1103     4448    0.248    326M    41M
  L4     51/0     478.83   1.0    513.2   102.7    410.4     506.9     96.5       2.7   4.9    188.7    186.4       785     6691    0.416    676M    32M
  L5    417/0    4780.13   1.0    536.0    99.9    436.1     528.6     92.5       0.0   5.3    191.2    188.6      2870     5185    0.554    660M    19M
  L6    562/0   47961.49   0.0    463.5    87.9    375.6     422.4     46.8       0.0   4.8    206.5    188.2      2298     3280    0.701    616M   110M
 Sum   1035/0   53256.75   0.0   1698.1   387.1   1311.1    1750.1    439.1       2.7  15.8    177.4    182.9      9800    29849    0.328   2279M   204M
...
Cumulative compaction: 1750.15 GB write, 37.80 MB/s write, 1698.14 GB read, 36.68 MB/s read, 9800.2 seconds

See the Compaction Stats section of the RocksDB tuning guide for an explanation of the fields. I'm confused by the W-Amp column. The table indicates we have a write amplification of 15.8, but the cumulative compaction stats indicate compactions wrote 1750 GB of data even though we only have 53 GB remaining (33x amplification). Regardless, I think the write amplification is what is hurting our write performance. Per the SPARROW theorem, the only way to decrease write amplification is to increase read amplification. Unfortunately, my attempts so far at doing so have always resulted in worse performance.

Next experiment is to use a 32 MB write buffer as that will result in L0 and L1 being slightly closer in size which is a recommendation the RocksDB folks make somewhere.

petermattis commented 7 years ago

The experimentation continued this weekend, but so far nothing has made an improvement. Increasing the write buffer size to 32 MB seemed to show a marginal improvement in performance, but nothing significant. I also tried setting bytes_per_sync to smooth out disk writes. No benefit. One interesting bit came from a run this weekend where writes were stopped after 24h:

As expected, performance recovered once we were only performing reads. But notice the absolute performance here is less than in the run above where writes were disabled after 12h. The difference is 55k ops/sec vs 48k ops/sec.

The disk metrics during the read-only portion are surprising:

We're performing only read operations and the vast majority of those operations are to a small set of keys, yet we're reading a significant amount from disk. This indicates that my earlier assumption that the reads were being caused by compactions might have been incorrect. Note that the disk metric graphs are for a single machine. So we're reading ~300 MB/sec across the 6 nodes in the cluster.

I instrumented ycsb to verify my assumption about the skewed access to keys. I simulated writing 216m keys (which is approximately how many were written over 24h) and then read 100m keys. The top 10000 keys read represented 99.9% of the reads. So something isn't making sense here. 10000 keys and their values can easily fit in the block cache. Even assuming each key is in a separate block (which is likely), we're only talking about 312MB. So why isn't the block cache hit rate 100%? And very interesting that this 312MB number so closely matches that 300 MB/sec number from the disk metrics. Could something be causing a block to be read and not cached on every operation?

petermattis commented 7 years ago

My simulated ycsb mentioned in the previous message was flawed. Fixing it reveals that the top 10k keys only represent 57% of the reads, the top 100k keys represent 71% of reads and the top 1m keys represent 85% of reads. (These numbers are all for a simulated database where ycsb has inserted 216m keys).

Based on this, I'd expect a performance boost to decreasing the block size, though doing so requires more memory. I'm going to experiment with that in order to verify that I'm starting to understand what is going on.

petermattis commented 7 years ago

Looking at this from another direction, denim is configured with 6.9GB of block cache per node. This translates to ~1.35m 32KB blocks across the 6 machines. Looking at my simulated output, the top 1.35m keys represent 88% of reads which looks to be in the same ball park as the observed block cache hit rate.

Reducing the block size to 4KB will allow caching 10m blocks which will capture 99.9% of reads. We'll know in 24h if this theory is correct.

petermattis commented 7 years ago

Using a 4KB block size neither improved performance, nor the block cache hit rate. The lack of performance improvement makes sense in light of the block cache hit rate. I'm scratching my head right now.

Disk read bandwidth was reduced to ~1/4 of what it was when using a 32KB block size:

Disk read ops look similar to using a 32KB block size:

petermattis commented 7 years ago

I'm suspicious that a significant part of the performance degradation is due to the increasing data size and the workload. With RocksDB, write amplification is a factor of various knobs and how much data is present. As more data gets added, more levels of RocksDB get used where each additional level is 10x the size of the previous level.

How does the workload play into this? For both ycsb and kv, writes are uniformly distributed. So every byte written will need to get propagated down through the levels. The more data already present, the more data will get rewritten.

petermattis commented 7 years ago

I still plan to verify my suspicion above that the performance degradation is due to the increasing data size and random writes, but this isn't a blocker for 1.1.

petermattis commented 6 years ago

Closing this as old. One significant factor in the degradation was the growth of the timestamp cache which has been alleviated by its reimplementation.

cockroachdb / cockroach

perf: investigate slow performance degradation #14108

16283 fixed the egregious performance degradation, but the slow performance degradation is still present. The graphs below are from running `ycsb --concurrency 1000`. The first period on the graph is for `--splits 5000` and the second with `--splits 10000`.

cockroachdb / cockroach

perf: investigate slow performance degradation #14108

16283 fixed the egregious performance degradation, but the slow performance degradation is still present. The graphs below are from running ycsb --concurrency 1000. The first period on the graph is for --splits 5000 and the second with --splits 10000.

16283 fixed the egregious performance degradation, but the slow performance degradation is still present. The graphs below are from running `ycsb --concurrency 1000`. The first period on the graph is for `--splits 5000` and the second with `--splits 10000`.