Closed petermattis closed 6 years ago
At a recent talk, a user asked about space amplification due to having to sync to disk on every write. If we're doing only sequential small writes, is there a detrimental effect where each WAL write rewrites a whole block? This shouldn't be happening in the above example due to batching, but I'm curious.
If we're doing only sequential small writes, is there a detrimental effect where each WAL write rewrites a whole block?
Yes. Any database that provides durability needs to do this. And like other databases, we batch such syncs together so usually you're writing more than a single block.
The following is looking at where space/write amplification is coming from using kv
. The schema used by kv
is:
CREATE TABLE IF NOT EXISTS test.kv (
k BIGINT NOT NULL PRIMARY KEY,
v BYTES NOT NULL
)
I tested using the default value size which is 1 byte. kv
write operations are performed using UPSERT
:
UPSERT INTO test.kv (k, v) VALUES ($1, $2)
The UPSERT
operation translates into a write to a single key. For example, /Table/51/1/2320603018012
. Note this is the pretty form of the key, in hex it looks like bb89fd2034702659595caa88
which is 12 bytes long. kv
hits the 1PC fast-path which means only this key is being written (i.e. there is no transaction record). The raw key gets appended with 9 bytes of timestamp to give a 21 byte key. The 1 byte value gets encoded as 8 bytes due to the column ID and checksum. The encoding into the RocksDB write batch format adds an additional 5 bytes for the tag and lengths of the key and value. The write batch also has a 12 byte header which translates into a final write batch size of 45 bytes. So we've gone from 8 bytes of key and 1 byte of data to 45 bytes of write batch.
The command write batch then gets placed inside of a storagebase.RaftCommand
proto which is what we encode as the Raft proposal. The encoded size of the RaftCommand
is 428 bytes:
RaftCommand.ProposerReplica: 6
RaftCommand.ProposerLease: 56
RaftCommand.ReplicatedEvalResult: 308
RaftCommand.WriteBatch: 47
The above sums to 417 bytes. The remaining 11 bytes are consumed by RaftCommand.MaxLeaseIndex
and the encoding overhead of the above fields.
The encoded RaftCommand
command is written to the Raft log along with the Raft hard state (raftpb.HardState
). The Raft HardState structure and key consume 37 bytes. The total size of write batch containing the HardState and Raft log entry is 569 bytes. This gets written to the WAL. Most of this data is also written to the RocksDB memtable and eventually flushed to disk, though key compression occurs at that level making the precise size hard to determine.
When the Raft command is subsequently committed, we apply the write batch contained in the command. But this write batch (the 45 bytes from evaluating the command) is extended with the MVCC stats and updates to the RaftAppliedIndex
and LeaseAppliedIndex
. The MVCC stats weigh in at 151 bytes and the applied index entries weigh in at 64 bytes. The total size of the write batch for committing the Raft command is 260 bytes. There is again another amplification for writing this data to the memtable and then out to an sstable.
All told, we're writing 829 bytes to the RocksDB WAL on every UPSERT
operation and approximately that same amount of data to sstables (ignoring LSM write amplification).
We haven't been paying much attention to the write amplification at this level recently. There might be some easy wins here. For example, the overhead of ReplicatedEvalResult
breaks down as:
ReplicatedEvalResult.State: 133
ReplicatedEvalResult.State.Stats: 119
ReplicatedEvalResult.Delta: 119
Note that Delta
and State.Stats
are both MVCCStats
protos. So we're sending MVCCStats
twice in every proposal! The fields of MVCCStats
are mostly sfixed64
because that simplifies some calculations when the stats are written to disk, but for storage in the proposal we should be using varint encodings. And gogoprotobuf has the habit of outputting zero values for non-nullable fields and I imagine most of these fields are zero.
Cc @bdarnell, @tschottdorf
ReplicatedEvalResult.State.Stats
is always zero. @tschottdorf is there a reason that field isn't null-able?
Various opportunities:
ReplicaState.stats
can be nullable. There seems to be no reason for it to be non-nullable. This shrinks the proposal by 119 bytes.ReplicaState.{gc,txn_span_gc}_threshold
can be nullable. There is a TODO(tschottdorf)
pointing out some complexity which I'm missing. This shrinks the proposal by 12 bytes.MVCCStats
where the fields have type int64
instead of sfixed64
. The shrinks ReplicatedEvalResult.Delta
by 83 bytes.sfixed64
for the fields of MVCCStats
because the size of the MVCCStats
key and value are included in the stats. We could change ComputeStats
to exclude the MVCCStats
key from inclusion allowing for variable size encoding. There is a bit of a migration headache here, but I don't think it is insurmountable. MVCCStats
has 13 integer fields and 1 boolean field. There isn't much point in using sfixed64
for most of these fields given that they are range-size values. Switching to sfixed32
would cut the encoded size in half. Again, there is a migration headache. We'd probably have to create a parallel structure because MVCCStats
are sometimes aggregated across ranges. RaftCommand.ProposerLease
is almost always an epoch-based lease. We could optimize that case by making ProposerLease
nullable and adding a ProposerLeaseEpoch
field. This would save ~50 bytes.Taken together, the above could save ~330 bytes. Before worrying about the migration headaches, we could see if the savings translate into performance benefits.
I think the only reason State.Stats is non-nullable is to avoid extra pointers in the code. But it definitely ought to be nullable, at least the way it's used here. In fact, I think only a few fields of ReplicaState get transferred via ReplicatedEvalResult (and that's the only place where ReplicaState is serialized as a proto), so I think we could split ReplicaState up and only put the parts in ReplicatedEvalResult that need to be there.
More:
I think we can make all of the fields in roachpb.Lease
nullable. Then for epoch-based leases, we can clear all of the fields in ProposerLease
except epoch
. Downstream of Raft, ProposerLease.epoch
is the only field that is accessed so there are no migration concerns.
In addition to using varints in MVCCStats, most of them are zero most of the time, so making them nullable in the proposal would reduce the size a bit further.
We should change (enhance) gogoproto so that it doesn't serialize primitive fields that are zero.
Or move selected protos to proto3, which no longer has the unset-vs-zero distinction (so the move has to be done carefully. note that the version of MVCCStats that uses sfixed64 must write its zero fields to disk) and therefore never serializes fields with zero values.
note that the version of MVCCStats that uses sfixed64 must write its zero fields to disk
Good point. So many migration headaches.
For any write operation, we write a a number of non-versioned keys:
/Local/RangeID/X/u/RaftLog/logIndex:Y
/Local/RangeID/X/u/RaftLastIndex
/Local/RangeID/X/u/RaftHardState
/Local/RangeID/X/r/RaftAppliedIndex
/Local/RangeID/X/r/LeaseAppliedIndex
The values are generally small integers plus a checksum and ~7 bytes in size. But non-versioned values are stored in MVCCMetadata.raw_bytes
. The various non-nullable fields of MVCCMetadata
, despite being the zero value, contribute an additional 14 bytes to the encoded size. Getting rid of this 14 bytes of overhead per non-versioned Raft key would save 70 bytes.
The keys themselves are all ~11 bytes in size. #13392 explored merging RaftAppliedIndex
and LeaseAppliedIndex
into RangeStats
which would save an additional 22 bytes. We could probably also get rid of RaftLastIndex
and instead do a scan of the RaftLog
keys to find the last index which would save an additional 18 bytes (11 bytes of key + 7 bytes of value).
Total potential savings is ~430 bytes (from a base of 829 bytes). Note that in addition to reducing disk traffic, this would also significantly reduce network traffic.
Unfortunately MVCCMetadata is one of the few remaining protos that is serialized below raft (MVCCStats is another, but we're already talking about separating its below-raft proto from the one used in ReplicatedEvalResult). Changing MVCCMetadata's encoding without introducing consistency problems is very tricky.
It appears that ReplicatedEvalResult.State
is zero for normal write operations. We could just make that field nullable.
ReplicatedEvalResult.{Start,End}Key
consume 25 bytes. And there are 4 non-nullable boolean fields in ReplicatedEvalResult
that are almost always false. These consume 2 bytes per field (8 bytes total).
@bdarnell Understood that some of these changes would be very tricky, I'm just trying to enumerate where all the wastage is right now. Are the Raft log and Raft state keys included in consistency checks? I think they must be excluded since they are inconsistent by definition. We can't tweak MVCCMetadata
, but perhaps we can tweak the values for those keys.
Making ReplicatedEvalResult.State nullable looks like an easy win. {Start,End}Key
are 25 bytes here, but they could be larger (depending on schema). There's some discussion about removing those in #16075.
The raft log and HardState keys are not included in consistency checks, so changes to them can be handled with local migrations. (raftpb.HardState is included in below_raft_protos_test, but I think it's safe to change).
MVCCStats
as stored under /Local/RangeID/X/r/RangeStats
is a below-raft proto because we need to ensure that the value for that key is identical on all replicas for the consistency checker. This complicates changing MVCCStats
but does not prohibit it. For example, we could rename the existing MVCCStats
to MVCCLegacyStats
and create a new MVCCStats
proto with parallel fields, perhaps encoded differently, perhaps with additional fields. And then we'd have to enhance the consistency checker so that it translated from MVCCStats
to MVCCLegacyStats
before computing the checksum. This would require a small bit of special casing in Replica.sha512
. For migration, it would probably also be necessary to translate the stats when sending a snapshot. The high-level idea is that the format as seen by the consistency checker and in snapshots is more or less frozen, but that still gives us a small bit of wiggle room
The version
cluster setting might help us avoid some of that migration complexity. Once the version
field is set to 1.2, we know that 1.1 binaries are no longer an issue. So if we support both old and new formats, we can do the migration on the first command for the range that is done after the cluster version is set to 1.2. (guaranteeing that this has been done for all ranges before the migration to 1.3 so we can delete the old code is trickier)
Wow, so much waste. I assume you're going to run the perf experiments, @petermattis? Definitely looks like we should be tackling the worst offenders here.
Re: the consistency checker, it's not central but there are some related thoughts here in the context of re-vamping our (key) GC. The proposal boils down to making the consistency checker smarter about what it includes in its computations. I think we can avoid most migrations for the consistency checker by introducing a consistency checker version and a) responding with an error to a consistency request from a node which has the old (absent) version or b) ignoring checksums computed by old replicas. (and the period where we see both is just until the version bump).
edit: oh, or what Ben says.
Yeah, I'm going to run some perf experiments first without worrying about backwards compatibility and for the trickier changes we can put our heads together about how to accomplish a migration.
I implemented most of the above in an experimental form and have reduced the "raft batch" size from 569 bytes to 265 bytes and the "commit batch" from 260 bytes to 110 bytes. Surprisingly, the effect on disk write bandwidth as measured by iostat
is smaller than expected. With the experimental changes, iostat
is reporting disk bandwidth for my test of 30 MB/sec. Strangely, RocksDB internal metrics of writes to the WAL and due to flushes and compactions only sum to 3.5 MB/sec.
Without these change (i.e. current master), iostat
reports disk bandwidth of 37 MB/sec while the RocksDB metrics report 8.7 MB/sec. Notice that the change from 8.7 MB/sec to 3.5 MB/sec is very close to what we'd expect given the space reductions.
One possibility is that WAL syncs are being done on changes smaller than a page size. That is, we're writing a few hundred bytes to the WAL and syncing which is getting written out to the actual device as a full page. I added some instrumentation about how many bytes were added to the WAL since the last sync. The average over a short run was 3032 bytes. The disks I'm running on have a 4KB block size. So every sync is actually writing either 1 or 2 4KB blocks. That still doesn't explain the 8.5x difference between the iostat
numbers and the RocksDB numbers.
Interestingly, setting rocksdb.min_wal_sync_interval = '1ms'
reduces the iostat
numbers to 16 MB/sec. And setting that parameter to 5ms
reduces the iostat
numbers to 8 MB/sec, but in the latter case it also reduces throughput. So the iostat
numbers are connected to the amount of data being synced, but I'm not fully understanding how.
The complete set of experimental changes (which includes #18689, an experimental version of #16075 and a few other changes) shows a 5-10% improvement in throughput with lower latencies:
With the changes so far, the size of a Raft proposal for simple kv operations has reduced from 428 bytes to 160 bytes. The size of the "raft batch" has reduced from 569 bytes to 268 bytes. And the size of the "apply batch" hasn't changed at all.
The further optimizations in the pipeline (#16075 and #13392) will cut another 50 bytes from both the Raft proposal and "raft batch" and ~115 bytes from the "apply batch".
All proposed improvements here have been addressed and we now have some testing in place to prevent any regressions. We'll want to go back and re-evaluate places where we can trim various proto overhead in the future, but I'm going to close this issue.
Forked from #18657:
The schema in question is:
And the inserts have the form:
The load generator script is generating random 3 letter keys using upper-case characters giving a total of 17576 rows.
This issue is about understanding where the space and write amplification is coming from. Some places to investigate are the overhead of
RaftCommand
,raftpb.HardState
andMVCCStats
which are all written on every write.