Open jbowens opened 1 year ago
One immediate observation: we write RANGEKEYDEL
s for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/
. Is that necessary? It would be mildly more efficient to omit RANGEKEYDEL
s for any keyspaces that can never contain RANGEKEYSET
s.
Another observation: the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?
the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?
https://github.com/cockroachdb/pebble/blob/master/ingest.go#L413. Gave this a look and the sstableKeyCompare
function a look. I think the behaviour is as expected.
we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.
What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.
During splits, we don't really expect snapshots, at least in the common case. If a follower is slow to apply the split trigger, this can cause a spurious Raft message from the split-off leader to instantiate a new Raft replica, which will then request a snapshot, but we reject these since we'd rather wait for the split trigger. It's possible for that replica to apply the split trigger so late that it'll immediately need a snapshot to catch up though. I suppose there can be further race conditions here too -- I'll ask around for details.
One immediate observation: we write
RANGEKEYDEL
s for every keyspace other than the local unreplicated range ID keyspace:/Local/RangeID/191857/u""/
. Is that necessary? It would be mildly more efficient to omitRANGEKEYDEL
s for any keyspaces that can never containRANGEKEYSET
s.
This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes
MVCC stats). It's thus safe to omit these RANGEKEYDEL
s for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.
What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.
Yeah, snapshots for fallen-behind followers makes sense as an explanation. We're seeing many ingests and ~70% of ingests be ingested as flushable, which seems high for all catch-up snapshots.
Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.
Agreed, it looks correct to me.
This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.
Got it, nice. This should become moot soon enough with the virtual sstables work, which can eliminate RANGEDEL
s and RANGEKEYDEL
s with a new ExciseAndIngest operation that virtualizes any overlapping sstables at ingest time to remove existing data.
What does "overlap" mean in that context? Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested? That should in fact be rare, at least on a "seasoned" cluster that sees regular memtable rotations.
As Erik pointed out, certain pathological patterns could have snapshots hit the memtable. For example, if a replica got removed and immediately re-added, we probably still have some of its bits in the memtable (from log application, or even just the raft log, which we also clear in the SSTs). Also, any snapshot that is sent as a result of log truncation will hit an existing replica, so very likely overlaps the memtable as well. These patterns may not be as rare as we'd like. We'd really have to know what snapshots were going on as these flushable ingestions happened.
We do have stats returned on each snapshot ingestion^1, so possibly we could print which snapshots are ingested with memtable overlap. Then we can pick a few and spot-check the logs for their history. This might give us a better intuition on when we see these flushable ingests.
Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested?
Exactly.
We do have stats returned on each snapshot ingestion
Good idea, put up cockroachdb/pebble#2422.
I think we can also pull the WALs and manifest from a node and dump them to find the conflicting writes.
Actions for follow up here: look at the test cluster for the proportions of flushable ingests to ingest.
Seems like we don't actually have a timeseries for count of total ingestions. (filed #103744)
Looking at the flushable ingestions, there is a periodicity to it that suggests there's a correlation with workload restarts.
Now that the telemetry cluster has v23.1.x, we can also look there for more evidence.
The ingest-as-flushable graph appears to mirror the graph of snapshot receptions, suggesting applying a snapshot almost always result in memtable overlap. I don't understand why. I don't think we can chalk this up to an artifact of the telemetry cluster's artificial workload and split+scatters. If we're able to prevent this memtable overlap, we can likely significantly reduce w-amp (and r-amp) during snapshot reception.
Although arguably this will all be moot with the planned ingest+excise operation. Edit: This isn't true. Ingest+excise won't be able to cheaply "excise" the memtable, at least as designed. This will force all of these ingests into the non-flushable code path, forcing a "write stall" to wait for the flush to complete. I think ingest+excise actually increases the importance of avoiding these writes before ingesting the snapshot.
I grabbed the WALs and MANIFESTs from a test cluster node.
Here's an example flushable ingest in a WAL:
4879581.log
0(48) seq=28463980876 count=6
INGESTSST(4879575)
INGESTSST(4879580)
INGESTSST(4879576)
INGESTSST(4879577)
INGESTSST(4879578)
INGESTSST(4879579)
EOF
and the corresponding version edit:
126468172/322036
log-num: 4879582
next-file-num: 4879588
last-seq-num: 28463981903
added: L4 4879575:1526<#28463980876-#28463980876>[/Local/RangeID/318347/r""/0,0#28463980876,RANGEKEYDEL-/Local/RangeID/318347/s""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
added: L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
added: L5 4879576:2311<#28463980878-#28463980878>[/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980878,RANGEKEYDEL-/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
added: L5 4879577:1248<#28463980879-#28463980879>[/Local/Lock/Intent/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980879,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
added: L5 4879578:1224<#28463980880-#28463980880>[/Local/Lock/Intent/Table/16060/1/-248782262732346608/0,0#28463980880,RANGEKEYDEL-/Local/Lock/Intent/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
added: L3 4879579:263676<#28463980881-#28463980881>[/Table/16060/1/-248782262732346608/0,0#28463980881,RANGEKEYSET-/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
126468919/322037
Judging by the fact that 4879580 is the only sstable to ingest into L0, it seems that's where the memtable overlap was:
added: L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
In the final 1% of the previous WAL 4879549.log, there are two batches containing overlapping keys:
12083692(48) seq=28463980624 count=1
SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)
12084163(52) seq=28463980626 count=1
SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)
Are these keys expected to be written before the replica's snapshot is ingested? Is there someway we could avoid writing them until the snapshot has been ingested?
@pavelkalinnikov Can you have a look at the above?
cc @cockroachdb/replication
12083692(48) seq=28463980624 count=1 SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY )
This key is always written by getOrCreateReplica
when a replica is first "encountered" on a store (either by means of a split, or snapshot application). This includes the case when a replica is initialized from a snapshot below the HandleSnapshot
call:
Only then we handle raft ready https://github.com/cockroachdb/cockroach/blob/d6ca1394d1fbbda2bf3ce3f5bfe1a78d096719bf/pkg/kv/kvserver/store_raft.go#L465
which eventually ingests the snapshot.
So, one possible way to optimize is avoiding this write, because the snapshot ingestion writes this key anyway. This doesn't seem trivial at a quick glance though, because of the way getOrCreateReplica
is coupled with other paths. Looking though. Perhaps we could avoid writing the range deletes in applySnapshot
if we know the keys are not present.
12084163(52) seq=28463980626 count=1 SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY )
This key, I'm not sure why it's already written. The getOrCreateReplica
path doesn't write it, so it must have been something else.
RaftHardStateKey and RaftReplicaIDKey are both written for an uninitialized replica (see CreateUninitializedReplica
)
@sumeerbhola I can only see the RaftReplicaIDKey
write there. I think the RaftHardStateKey
is written by each of the two initialization flows, but elsewhere (hence we see them in separate batches in @jbowens's comment?):
Before any of these flows, it's possible that only RaftReplicaIDKey
is written (e.g. we've got some requests for this replica before it got initialized)?
I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere
The HardState must be getting written somewhere for an uninitialized replica too, since I think an uninitialized replica can vote (so needs to remember it). This is also the reason for the dance in https://github.com/cockroachdb/cockroach/blob/f397cf964c8d41f4b5eb90fe330298ce65fe818b/pkg/kv/kvserver/store_split.go#L69-L81
Another possible scenario:
getOrCreateReplica
creates an unitialized replica and writes the RaftReplicaIDKey
HardState
HardState
even without voting if it discovers a non-zero Term
or Commit
from other nodes (I vaguely remember seeing this happen)@sumeerbhola Yeah, I think you're right. In the scenario above both keys can be written, and in separate batches.
We've discussed getting rid of uninitialized replicas (or at least make them not have any IO) previously. Could be good to reiterate on this. No uninit replicas / IO => no RaftReplicaIDKey / HardState
keys written before ingestion => more efficient ingestion.
I am wondering how much of the work in writeUnreplicatedSST
is optional for the case of a new uninitialized range followed by initialization https://github.com/cockroachdb/cockroach/blob/f397cf964c8d41f4b5eb90fe330298ce65fe818b/pkg/kv/kvserver/replica_raftstorage.go#L750-L761
There will be no raft log, RaftTruncatedStateKey, RangeLastReplicaGCTimestampKey. We can read RaftReplicaIDKey
to check that the replica id hasn't changed. And if the HardState
has not changed we could avoid doing any of that work (fast-path). But perhaps HardState.Commit
will change since IIRC, that value must be 0 for an uninitialized replica.
btw, are we accidentally clearing the RangeTombstoneKey
, which would be a bug?
To verify the theories from above, I added the following instrumentation to snapshot ingestion and spun up a multi-node local cluster:
--- a/pkg/kv/kvserver/replica_raftstorage.go
+++ b/pkg/kv/kvserver/replica_raftstorage.go
@@ -514,6 +514,33 @@ func (r *Replica) applySnapshot(
log.Infof(ctx, "applied %s (%s)", inSnap, logDetails)
}(timeutil.Now())
+ // Look at what keys are already written into the unreplicated key space.
+ {
+ unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.ID().RangeID)
+ unreplicatedStart := unreplicatedPrefixKey
+ unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd()
+ it := r.store.TODOEngine().NewEngineIterator(storage.IterOptions{UpperBound: unreplicatedEnd})
+ defer it.Close()
+ var ok bool
+ var err error
+ for ok, err = it.SeekEngineKeyGE(storage.EngineKey{Key: unreplicatedStart}); ok; ok, err = it.NextEngineKey() {
+ key, err := it.UnsafeEngineKey()
+ if err != nil {
+ panic(err)
+ }
+ log.Infof(ctx, "found unreplicated key: %s", key.Key)
+ }
+ if err != nil {
+ panic(err)
+ }
+
+ prevHS, err := r.raftMu.stateLoader.StateLoader.LoadHardState(ctx, r.store.TODOEngine())
+ if err != nil {
+ panic(err)
+ }
+ log.Infof(ctx, "found prev HardState: %+v", prevHS)
+ }
+
unreplicatedSSTFile, nonempty, err := writeUnreplicatedSST(
Each INITIAL snapshot (during upreplication) looked the same. Each had a RaftHardStateKey
and a RaftReplicaIDKey
. The HardState's vote was 0 (the common-case, but not always true), a commit index was 0 (I believe this is always the case), and a term of 6.
I230608 04:22:04.625626 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 128 found unreplicated key: /Local/RangeID‹/13›/‹u›/‹RaftHardState›
I230608 04:22:04.625641 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 129 found unreplicated key: /Local/RangeID‹/13›/‹u›‹"rftr"›
I230608 04:22:04.625674 1331 kv/kvserver/replica_raftstorage.go:541 ⋮ [T1,n2,s2,r13/2:{-}] 130 found prev HardState: {Term:6 Vote:0 Commit:0}
I230608 04:22:04.684671 1331 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n2,s2,r13/2:‹/Table/1{1-2}›] 131 applied INITIAL snapshot 35998b43 from (n1,s1):1 at applied index 153 (total=59ms data=1.8 KiB ingestion=6@49ms)
The RaftReplicaIDKey
is written during replica creation. It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.
It is less clear what we should do with the RaftHardStateKey
. Snapshots are always accompanied by an updated HardState whose commit index is set to the index of the snapshot. We currently write this updated HardState into the snapshot's ssts, so that the HardState is updated atomically with the snapshot ingestion. Is this necessary for correctness? Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?
It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.
We're hoping to move to remove the range deletion, replacing it with an atomic ingest+excise operation that virtualizes overlapping sstables to hide existing data. This will allow ingested snapshot sstables to unconditionally ingest into L6. While we could separately ingest+excise on either side of this key, it'd be preferable to avoid the tiny files.
Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?
The ReplicasStorage (separated raft log) design included an init step at crash recovery time that would see if HardState.Commit < RangeAppliedState.RaftAppliedIndex
and if so update it to make it equal to RangeAppliedState.RaftAppliedIndex
. I don't know if we already do something like this. If we do (or can), we could check that the only thing that needs updating is HardState.Commit
(as you outlined above) and then write it after the ingest (the separated raft log design calls for the invariant that HardState.Commit
is 0 for an uninitialized replica, so I don't think we should write it before the ingest).
@tbg
Typically the commit index in the HardState does not require durability and can be reconstructed after a crash.
I would tread careful here due to apply-time conf changes. There is a secret invariant that you can't have more than one "committed but not known committed" confchange in the log. I don't know how much this can be relaxed but either way I am not terribly motivated to relax anything in that area, since it is generally poorly understood and problems would not be discovered until it is potentially too late.
I was actually thinking the other day that we should just get out of the business of apply-time conf changes. I filed an issue^1 to that effect. With that, the Committed
index would be entirely discretionary, I think, mod, perhaps, additional meaning we've attached to it with replica quiescence, which one day might be on its way out^2, too.
In the 23.1 test cluster, we're observing higher than expected exercising of the new flushable ingests code path. This code path is triggered one or more of the sstables in an ingest overlaps with one or more of the memtables. In 22.2 and earlier, ingest would flush the memtable, wait for it complete and then proceed. In 23.1 the ingest writes a record committing to the ingest to the WAL and layers the ingested sstables onto the 'flushable' queue containing the memtables and large batches.
The flushable ingest numbers we're observing indicate that many snapshot applications overlap a memtable, forcing memtable rotation and flushes. This is unexpected, because we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.
Here's a few example version edits pulled from the test cluster.
There's no specific recording of which version edits correspond to flushable ignests that overlap with the memtable, but I believe all the ones that ratchet
log-num
are, since only flushes should be ratchetinglog-num
.Jira issue: CRDB-25798
Epic CRDB-27235