perf: quantify space/write amplification

petermattis commented 6 years ago

Forked from #18657:

The disk io though is quite worrying at around 26MiB writes/s and 8% CPU spent in iowait as indicated by dstat. The data being updated is very small (one integer). Granted, CockroachDB keeps all past values so let's assume each update is like an insert. The string has 3 bytes plus 4 byte integer plus overhead for metadata and encoding. Let's assume a generous 64 bytes per entry. At 2500 qps, that would be around 256KiB/s. LSM storage engines have write amplification. Not sure how many levels were generated in this test but I'd assume not too many. So let's assume each row is actually written 4 times as time goes by. That's 1MiB/s. Still off by a factor of 26. Not sure where all this disk io comes from but it seems excessive.

Unless you were running for a long period of time, the LSM write amplification should be a non-factor. Note that every write involves a 4x write amplification outside of the LSM due to writing the Raft log and then subsequently committing the command. Both the Raft log write and the committing of the command involve a 2x write amplification due to RocksDB writing to its internal WAL and then subsequently flushing the memtable. There is some a possibility this will be reduced in the future. See #16948.

But that write amplification still does explain the 26MiB/sec that you're seeing. The 64 bytes per entry might be a bit pessimistic given some of the other Raft and internal state that is written on each write. I'm going to file an issue to investigate in detail where the space overhead for each write is coming from.

The schema in question is:

CREATE TABLE bench (s string primary key, n int);

And the inserts have the form:

INSERT INTO bench (s, n) VALUES($1, 1) ON CONFLICT (s) DO UPDATE SET n = bench.n + 1

The load generator script is generating random 3 letter keys using upper-case characters giving a total of 17576 rows.

This issue is about understanding where the space and write amplification is coming from. Some places to investigate are the overhead of RaftCommand, raftpb.HardState and MVCCStats which are all written on every write.

tbg commented 6 years ago

At a recent talk, a user asked about space amplification due to having to sync to disk on every write. If we're doing only sequential small writes, is there a detrimental effect where each WAL write rewrites a whole block? This shouldn't be happening in the above example due to batching, but I'm curious.

petermattis commented 6 years ago

If we're doing only sequential small writes, is there a detrimental effect where each WAL write rewrites a whole block?

Yes. Any database that provides durability needs to do this. And like other databases, we batch such syncs together so usually you're writing more than a single block.

petermattis commented 6 years ago

The following is looking at where space/write amplification is coming from using kv. The schema used by kv is:

CREATE TABLE IF NOT EXISTS test.kv (
  k BIGINT NOT NULL PRIMARY KEY,
  v BYTES NOT NULL
)

I tested using the default value size which is 1 byte. kv write operations are performed using UPSERT:

UPSERT INTO test.kv (k, v) VALUES ($1, $2)

The UPSERT operation translates into a write to a single key. For example, /Table/51/1/2320603018012. Note this is the pretty form of the key, in hex it looks like bb89fd2034702659595caa88 which is 12 bytes long. kv hits the 1PC fast-path which means only this key is being written (i.e. there is no transaction record). The raw key gets appended with 9 bytes of timestamp to give a 21 byte key. The 1 byte value gets encoded as 8 bytes due to the column ID and checksum. The encoding into the RocksDB write batch format adds an additional 5 bytes for the tag and lengths of the key and value. The write batch also has a 12 byte header which translates into a final write batch size of 45 bytes. So we've gone from 8 bytes of key and 1 byte of data to 45 bytes of write batch.

The command write batch then gets placed inside of a storagebase.RaftCommand proto which is what we encode as the Raft proposal. The encoded size of the RaftCommand is 428 bytes:

RaftCommand.ProposerReplica:      6
RaftCommand.ProposerLease:        56
RaftCommand.ReplicatedEvalResult: 308
RaftCommand.WriteBatch:           47

The above sums to 417 bytes. The remaining 11 bytes are consumed by RaftCommand.MaxLeaseIndex and the encoding overhead of the above fields.

The encoded RaftCommand command is written to the Raft log along with the Raft hard state (raftpb.HardState). The Raft HardState structure and key consume 37 bytes. The total size of write batch containing the HardState and Raft log entry is 569 bytes. This gets written to the WAL. Most of this data is also written to the RocksDB memtable and eventually flushed to disk, though key compression occurs at that level making the precise size hard to determine.

When the Raft command is subsequently committed, we apply the write batch contained in the command. But this write batch (the 45 bytes from evaluating the command) is extended with the MVCC stats and updates to the RaftAppliedIndex and LeaseAppliedIndex. The MVCC stats weigh in at 151 bytes and the applied index entries weigh in at 64 bytes. The total size of the write batch for committing the Raft command is 260 bytes. There is again another amplification for writing this data to the memtable and then out to an sstable.

All told, we're writing 829 bytes to the RocksDB WAL on every UPSERT operation and approximately that same amount of data to sstables (ignoring LSM write amplification).

We haven't been paying much attention to the write amplification at this level recently. There might be some easy wins here. For example, the overhead of ReplicatedEvalResult breaks down as:

ReplicatedEvalResult.State: 133
ReplicatedEvalResult.State.Stats: 119
ReplicatedEvalResult.Delta: 119

Note that Delta and State.Stats are both MVCCStats protos. So we're sending MVCCStats twice in every proposal! The fields of MVCCStats are mostly sfixed64 because that simplifies some calculations when the stats are written to disk, but for storage in the proposal we should be using varint encodings. And gogoprotobuf has the habit of outputting zero values for non-nullable fields and I imagine most of these fields are zero.

Cc @bdarnell, @tschottdorf

petermattis commented 6 years ago

ReplicatedEvalResult.State.Stats is always zero. @tschottdorf is there a reason that field isn't null-able?

petermattis commented 6 years ago

Various opportunities:

ReplicaState.stats can be nullable. There seems to be no reason for it to be non-nullable. This shrinks the proposal by 119 bytes.
ReplicaState.{gc,txn_span_gc}_threshold can be nullable. There is a TODO(tschottdorf) pointing out some complexity which I'm missing. This shrinks the proposal by 12 bytes.
We can add a parallel version of MVCCStats where the fields have type int64 instead of sfixed64. The shrinks ReplicatedEvalResult.Delta by 83 bytes.
We used sfixed64 for the fields of MVCCStats because the size of the MVCCStats key and value are included in the stats. We could change ComputeStats to exclude the MVCCStats key from inclusion allowing for variable size encoding. There is a bit of a migration headache here, but I don't think it is insurmountable.
MVCCStats has 13 integer fields and 1 boolean field. There isn't much point in using sfixed64 for most of these fields given that they are range-size values. Switching to sfixed32 would cut the encoded size in half. Again, there is a migration headache. We'd probably have to create a parallel structure because MVCCStats are sometimes aggregated across ranges.
RaftCommand.ProposerLease is almost always an epoch-based lease. We could optimize that case by making ProposerLease nullable and adding a ProposerLeaseEpoch field. This would save ~50 bytes.

Taken together, the above could save ~330 bytes. Before worrying about the migration headaches, we could see if the savings translate into performance benefits.

bdarnell commented 6 years ago

I think the only reason State.Stats is non-nullable is to avoid extra pointers in the code. But it definitely ought to be nullable, at least the way it's used here. In fact, I think only a few fields of ReplicaState get transferred via ReplicatedEvalResult (and that's the only place where ReplicaState is serialized as a proto), so I think we could split ReplicaState up and only put the parts in ReplicatedEvalResult that need to be there.

More:

raftpb.HardState contains three fields; only one of them is frequently updated. We could split this up into separate rocksdb keys and only update the Commit field in most cases.
In addition to using varints in MVCCStats, most of them are zero most of the time, so making them nullable in the proposal would reduce the size a bit further

petermattis commented 6 years ago

I think we can make all of the fields in roachpb.Lease nullable. Then for epoch-based leases, we can clear all of the fields in ProposerLease except epoch. Downstream of Raft, ProposerLease.epoch is the only field that is accessed so there are no migration concerns.

petermattis commented 6 years ago

In addition to using varints in MVCCStats, most of them are zero most of the time, so making them nullable in the proposal would reduce the size a bit further.

We should change (enhance) gogoproto so that it doesn't serialize primitive fields that are zero.

bdarnell commented 6 years ago

Or move selected protos to proto3, which no longer has the unset-vs-zero distinction (so the move has to be done carefully. note that the version of MVCCStats that uses sfixed64 must write its zero fields to disk) and therefore never serializes fields with zero values.

petermattis commented 6 years ago

note that the version of MVCCStats that uses sfixed64 must write its zero fields to disk

Good point. So many migration headaches.

petermattis commented 6 years ago

For any write operation, we write a a number of non-versioned keys:

/Local/RangeID/X/u/RaftLog/logIndex:Y
/Local/RangeID/X/u/RaftLastIndex
/Local/RangeID/X/u/RaftHardState
/Local/RangeID/X/r/RaftAppliedIndex
/Local/RangeID/X/r/LeaseAppliedIndex

The values are generally small integers plus a checksum and ~7 bytes in size. But non-versioned values are stored in MVCCMetadata.raw_bytes. The various non-nullable fields of MVCCMetadata, despite being the zero value, contribute an additional 14 bytes to the encoded size. Getting rid of this 14 bytes of overhead per non-versioned Raft key would save 70 bytes.

The keys themselves are all ~11 bytes in size. #13392 explored merging RaftAppliedIndex and LeaseAppliedIndex into RangeStats which would save an additional 22 bytes. We could probably also get rid of RaftLastIndex and instead do a scan of the RaftLog keys to find the last index which would save an additional 18 bytes (11 bytes of key + 7 bytes of value).

Total potential savings is ~430 bytes (from a base of 829 bytes). Note that in addition to reducing disk traffic, this would also significantly reduce network traffic.

bdarnell commented 6 years ago

Unfortunately MVCCMetadata is one of the few remaining protos that is serialized below raft (MVCCStats is another, but we're already talking about separating its below-raft proto from the one used in ReplicatedEvalResult). Changing MVCCMetadata's encoding without introducing consistency problems is very tricky.

petermattis commented 6 years ago

It appears that ReplicatedEvalResult.State is zero for normal write operations. We could just make that field nullable.

ReplicatedEvalResult.{Start,End}Key consume 25 bytes. And there are 4 non-nullable boolean fields in ReplicatedEvalResult that are almost always false. These consume 2 bytes per field (8 bytes total).

@bdarnell Understood that some of these changes would be very tricky, I'm just trying to enumerate where all the wastage is right now. Are the Raft log and Raft state keys included in consistency checks? I think they must be excluded since they are inconsistent by definition. We can't tweak MVCCMetadata, but perhaps we can tweak the values for those keys.

bdarnell commented 6 years ago

Making ReplicatedEvalResult.State nullable looks like an easy win. {Start,End}Key are 25 bytes here, but they could be larger (depending on schema). There's some discussion about removing those in #16075.

The raft log and HardState keys are not included in consistency checks, so changes to them can be handled with local migrations. (raftpb.HardState is included in below_raft_protos_test, but I think it's safe to change).

petermattis commented 6 years ago

MVCCStats as stored under /Local/RangeID/X/r/RangeStats is a below-raft proto because we need to ensure that the value for that key is identical on all replicas for the consistency checker. This complicates changing MVCCStats but does not prohibit it. For example, we could rename the existing MVCCStats to MVCCLegacyStats and create a new MVCCStats proto with parallel fields, perhaps encoded differently, perhaps with additional fields. And then we'd have to enhance the consistency checker so that it translated from MVCCStats to MVCCLegacyStats before computing the checksum. This would require a small bit of special casing in Replica.sha512. For migration, it would probably also be necessary to translate the stats when sending a snapshot. The high-level idea is that the format as seen by the consistency checker and in snapshots is more or less frozen, but that still gives us a small bit of wiggle room

bdarnell commented 6 years ago

The version cluster setting might help us avoid some of that migration complexity. Once the version field is set to 1.2, we know that 1.1 binaries are no longer an issue. So if we support both old and new formats, we can do the migration on the first command for the range that is done after the cluster version is set to 1.2. (guaranteeing that this has been done for all ranges before the migration to 1.3 so we can delete the old code is trickier)

tbg commented 6 years ago

Wow, so much waste. I assume you're going to run the perf experiments, @petermattis? Definitely looks like we should be tackling the worst offenders here.

Re: the consistency checker, it's not central but there are some related thoughts here in the context of re-vamping our (key) GC. The proposal boils down to making the consistency checker smarter about what it includes in its computations. I think we can avoid most migrations for the consistency checker by introducing a consistency checker version and a) responding with an error to a consistency request from a node which has the old (absent) version or b) ignoring checksums computed by old replicas. (and the period where we see both is just until the version bump).

edit: oh, or what Ben says.

petermattis commented 6 years ago

Yeah, I'm going to run some perf experiments first without worrying about backwards compatibility and for the trickier changes we can put our heads together about how to accomplish a migration.

petermattis commented 6 years ago

I implemented most of the above in an experimental form and have reduced the "raft batch" size from 569 bytes to 265 bytes and the "commit batch" from 260 bytes to 110 bytes. Surprisingly, the effect on disk write bandwidth as measured by iostat is smaller than expected. With the experimental changes, iostat is reporting disk bandwidth for my test of 30 MB/sec. Strangely, RocksDB internal metrics of writes to the WAL and due to flushes and compactions only sum to 3.5 MB/sec.

Without these change (i.e. current master), iostat reports disk bandwidth of 37 MB/sec while the RocksDB metrics report 8.7 MB/sec. Notice that the change from 8.7 MB/sec to 3.5 MB/sec is very close to what we'd expect given the space reductions.

One possibility is that WAL syncs are being done on changes smaller than a page size. That is, we're writing a few hundred bytes to the WAL and syncing which is getting written out to the actual device as a full page. I added some instrumentation about how many bytes were added to the WAL since the last sync. The average over a short run was 3032 bytes. The disks I'm running on have a 4KB block size. So every sync is actually writing either 1 or 2 4KB blocks. That still doesn't explain the 8.5x difference between the iostat numbers and the RocksDB numbers.

Interestingly, setting rocksdb.min_wal_sync_interval = '1ms' reduces the iostat numbers to 16 MB/sec. And setting that parameter to 5ms reduces the iostat numbers to 8 MB/sec, but in the latter case it also reduces throughput. So the iostat numbers are connected to the amount of data being synced, but I'm not fully understanding how.

petermattis commented 6 years ago

The complete set of experimental changes (which includes #18689, an experimental version of #16075 and a few other changes) shows a 5-10% improvement in throughput with lower latencies:

screen shot 2017-09-23 at 3 15 14 pm

petermattis commented 6 years ago

With the changes so far, the size of a Raft proposal for simple kv operations has reduced from 428 bytes to 160 bytes. The size of the "raft batch" has reduced from 569 bytes to 268 bytes. And the size of the "apply batch" hasn't changed at all.

The further optimizations in the pipeline (#16075 and #13392) will cut another 50 bytes from both the Raft proposal and "raft batch" and ~115 bytes from the "apply batch".

nvanbenschoten commented 6 years ago

All proposed improvements here have been addressed and we now have some testing in place to prevent any regressions. We'll want to go back and re-evaluate places where we can trim various proto overhead in the future, but I'm going to close this issue.

cockroachdb / cockroach

perf: quantify space/write amplification #18659