cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

storage,kv: investigate overlap memtable during snapshot application #99273

Open jbowens opened 1 year ago

jbowens commented 1 year ago

In the 23.1 test cluster, we're observing higher than expected exercising of the new flushable ingests code path. This code path is triggered one or more of the sstables in an ingest overlaps with one or more of the memtables. In 22.2 and earlier, ingest would flush the memtable, wait for it complete and then proceed. In 23.1 the ingest writes a record committing to the ingest to the WAL and layers the ingested sstables onto the 'flushable' queue containing the memtables and large batches.

The flushable ingest numbers we're observing indicate that many snapshot applications overlap a memtable, forcing memtable rotation and flushes. This is unexpected, because we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.

Here's a few example version edits pulled from the test cluster.

61234027/158636
  log-num:       2073703
  next-file-num: 2073724
  last-seq-num:  15888760204
  added:         L0 2073696:1535<#15888757362-#15888757362>[/Local/RangeID/191853/r""/0,0#15888757362,RANGEKEYDEL-/Local/RangeID/191853/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073701:1180<#15888757363-#15888757363>[/Local/RangeID/191853/u""/0,0#15888757363,RANGEDEL-/Local/RangeID/191853/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073697:2387<#15888757364-#15888757364>[/Local/Range/Table/1199/1/2073185522769554992/0,0#15888757364,RANGEKEYDEL-/Local/Range/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073698:1248<#15888757365-#15888757365>[/Local/Lock/Intent/Local/Range/Table/1199/1/2073185522769554992/0,0#15888757365,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073699:1224<#15888757366-#15888757366>[/Local/Lock/Intent/Table/1199/1/2073185522769554992/0,0#15888757366,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073700:1200<#15888757367-#15888757367>[/Table/1199/1/2073185522769554992/0,0#15888757367,RANGEKEYDEL-/Table/1199/1/2110042154285458192/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)

61234765/158637
  next-file-num: 2073734
  last-seq-num:  15888760945
  added:         L0 2073717:1540<#15888760939-#15888760939>[/Local/RangeID/191841/r""/0,0#15888760939,RANGEKEYDEL-/Local/RangeID/191841/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073722:1179<#15888760940-#15888760940>[/Local/RangeID/191841/u""/0,0#15888760940,RANGEDEL-/Local/RangeID/191841/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073718:2938<#15888760941-#15888760941>[/Local/Range/Table/1199/1/2202183733075216192/0,0#15888760941,RANGEKEYDEL-/Local/Range/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073719:1248<#15888760942-#15888760942>[/Local/Lock/Intent/Local/Range/Table/1199/1/2202183733075216192/0,0#15888760942,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073720:1224<#15888760943-#15888760943>[/Local/Lock/Intent/Table/1199/1/2202183733075216192/0,0#15888760943,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073721:1200<#15888760944-#15888760944>[/Table/1199/1/2202183733075216192/0,0#15888760944,RANGEKEYDEL-/Table/1199/1/2220612048833167792/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)

61235499/158638
  log-num:       2073723
  next-file-num: 2073734
  last-seq-num:  15888761334
  added:         L0 2073725:159667<#15888757368-#15888760198>[/Local/RangeID/2/u/RaftHardState/0,0#15888760084,SET-/Table/13/1/2023-03-22T17:47:10.587605Z/850122606474362886/6/1/1679507230.587605315,0#15888759341,SET] (2023-03-22T17:47:10Z)

61235607/158639
  log-num:       2073724
  next-file-num: 2073734
  last-seq-num:  15888761346
  added:         L0 2073711:1539<#15888760199-#15888760199>[/Local/RangeID/191857/r""/0,0#15888760199,RANGEKEYDEL-/Local/RangeID/191857/s""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073716:1179<#15888760200-#15888760200>[/Local/RangeID/191857/u""/0,0#15888760200,RANGEDEL-/Local/RangeID/191857/v""/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073712:1828<#15888760201-#15888760201>[/Local/Range/Table/1199/1/-3381595941584118608/0,0#15888760201,RANGEKEYDEL-/Local/Range/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073713:1248<#15888760202-#15888760202>[/Local/Lock/Intent/Local/Range/Table/1199/1/-3381595941584118608/0,0#15888760202,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L0 2073714:1224<#15888760203-#15888760203>[/Local/Lock/Intent/Table/1199/1/-3381595941584118608/0,0#15888760203,RANGEKEYDEL-/Local/Lock/Intent/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)
  added:         L6 2073715:1200<#15888760204-#15888760204>[/Table/1199/1/-3381595941584118608/0,0#15888760204,RANGEKEYDEL-/Table/1199/1/-3363167625826167008/0,0#inf,RANGEDEL] (2023-03-22T17:47:10Z)

There's no specific recording of which version edits correspond to flushable ignests that overlap with the memtable, but I believe all the ones that ratchet log-num are, since only flushes should be ratcheting log-num.

Jira issue: CRDB-25798

Epic CRDB-27235

jbowens commented 1 year ago

One immediate observation: we write RANGEKEYDELs for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/. Is that necessary? It would be mildly more efficient to omit RANGEKEYDELs for any keyspaces that can never contain RANGEKEYSETs.

jbowens commented 1 year ago

Another observation: the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?

bananabrick commented 1 year ago

the end boundaries are all exclusive range deletion or range key sentinels. I wonder if we're handling them appropriately when calculating overlap?

https://github.com/cockroachdb/pebble/blob/master/ingest.go#L413. Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.

erikgrinaker commented 1 year ago

we thought the ingested sstables should have narrow keyspans local to the associated KV range. We expected that the memtable would not contain any keys that fall within the new replica's six keyspans.

What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.

During splits, we don't really expect snapshots, at least in the common case. If a follower is slow to apply the split trigger, this can cause a spurious Raft message from the split-off leader to instantiate a new Raft replica, which will then request a snapshot, but we reject these since we'd rather wait for the split trigger. It's possible for that replica to apply the split trigger so late that it'll immediately need a snapshot to catch up though. I suppose there can be further race conditions here too -- I'll ask around for details.

One immediate observation: we write RANGEKEYDELs for every keyspace other than the local unreplicated range ID keyspace: /Local/RangeID/191857/u""/. Is that necessary? It would be mildly more efficient to omit RANGEKEYDELs for any keyspaces that can never contain RANGEKEYSETs.

This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.

https://github.com/cockroachdb/cockroach/blob/c2460f1b0fc94c8fcbd98273cec35061494fc2dd/pkg/storage/mvcc.go#L3382-L3387

https://github.com/cockroachdb/cockroach/blob/c2460f1b0fc94c8fcbd98273cec35061494fc2dd/pkg/kv/kvserver/store_snapshot.go#L203-L206

jbowens commented 1 year ago

What about snapshots that are sent to catch up followers that have fallen behind on the log? These follower replicas could certainly have applied writes that are still present in the memtable when the catchup snapshot arrives.

Yeah, snapshots for fallen-behind followers makes sense as an explanation. We're seeing many ingests and ~70% of ingests be ingested as flushable, which seems high for all catch-up snapshots.

Gave this a look and the sstableKeyCompare function a look. I think the behaviour is as expected.

Agreed, it looks correct to me.

This was only done to guard against anyone writing range keys here in the future and forgetting to handle it here. But we now explicitly reject writing them across the local keyspan, because we don't handle them correctly elsewhere either (e.g. in SysBytes MVCC stats). It's thus safe to omit these RANGEKEYDELs for the local keyspans as long as that restriction is in place, and we should add a comment saying as much. I'll submit a PR when I'm back from PTO on Monday, unless anyone beats me to it.

Got it, nice. This should become moot soon enough with the virtual sstables work, which can eliminate RANGEDELs and RANGEKEYDELs with a new ExciseAndIngest operation that virtualizes any overlapping sstables at ingest time to remove existing data.

tbg commented 1 year ago

What does "overlap" mean in that context? Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested? That should in fact be rare, at least on a "seasoned" cluster that sees regular memtable rotations.

As Erik pointed out, certain pathological patterns could have snapshots hit the memtable. For example, if a replica got removed and immediately re-added, we probably still have some of its bits in the memtable (from log application, or even just the raft log, which we also clear in the SSTs). Also, any snapshot that is sent as a result of log truncation will hit an existing replica, so very likely overlaps the memtable as well. These patterns may not be as rare as we'd like. We'd really have to know what snapshots were going on as these flushable ingestions happened.

We do have stats returned on each snapshot ingestion^1, so possibly we could print which snapshots are ingested with memtable overlap. Then we can pick a few and spot-check the logs for their history. This might give us a better intuition on when we see these flushable ingests.

jbowens commented 1 year ago

Do you mean that there are keys in the memtable within the bounds of the SSTs to be ingested?

Exactly.

We do have stats returned on each snapshot ingestion

Good idea, put up cockroachdb/pebble#2422.

I think we can also pull the WALs and manifest from a node and dump them to find the conflicting writes.

nicktrav commented 1 year ago

Actions for follow up here: look at the test cluster for the proportions of flushable ingests to ingest.

jbowens commented 1 year ago

Seems like we don't actually have a timeseries for count of total ingestions. (filed #103744)

Looking at the flushable ingestions, there is a periodicity to it that suggests there's a correlation with workload restarts.

Screenshot 2023-05-22 at 5 06 37 PM Screenshot 2023-05-22 at 5 06 55 PM

Now that the telemetry cluster has v23.1.x, we can also look there for more evidence.

Screenshot 2023-05-22 at 5 18 45 PM Screenshot 2023-05-22 at 5 18 35 PM

The ingest-as-flushable graph appears to mirror the graph of snapshot receptions, suggesting applying a snapshot almost always result in memtable overlap. I don't understand why. I don't think we can chalk this up to an artifact of the telemetry cluster's artificial workload and split+scatters. If we're able to prevent this memtable overlap, we can likely significantly reduce w-amp (and r-amp) during snapshot reception.

Although arguably this will all be moot with the planned ingest+excise operation. Edit: This isn't true. Ingest+excise won't be able to cheaply "excise" the memtable, at least as designed. This will force all of these ingests into the non-flushable code path, forcing a "write stall" to wait for the flush to complete. I think ingest+excise actually increases the importance of avoiding these writes before ingesting the snapshot.

jbowens commented 1 year ago

I grabbed the WALs and MANIFESTs from a test cluster node.

Here's an example flushable ingest in a WAL:

4879581.log
0(48) seq=28463980876 count=6
    INGESTSST(4879575)
    INGESTSST(4879580)
    INGESTSST(4879576)
    INGESTSST(4879577)
    INGESTSST(4879578)
    INGESTSST(4879579)
EOF

and the corresponding version edit:

126468172/322036
  log-num:       4879582
  next-file-num: 4879588
  last-seq-num:  28463981903
  added:         L4 4879575:1526<#28463980876-#28463980876>[/Local/RangeID/318347/r""/0,0#28463980876,RANGEKEYDEL-/Local/RangeID/318347/s""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879576:2311<#28463980878-#28463980878>[/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980878,RANGEKEYDEL-/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879577:1248<#28463980879-#28463980879>[/Local/Lock/Intent/Local/Range/Table/16060/1/-248782262732346608/0,0#28463980879,RANGEKEYDEL-/Local/Lock/Intent/Local/Range/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L5 4879578:1224<#28463980880-#28463980880>[/Local/Lock/Intent/Table/16060/1/-248782262732346608/0,0#28463980880,RANGEKEYDEL-/Local/Lock/Intent/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
  added:         L3 4879579:263676<#28463980881-#28463980881>[/Table/16060/1/-248782262732346608/0,0#28463980881,RANGEKEYSET-/Table/16060/1/-230353946974395008/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)
126468919/322037

Judging by the fact that 4879580 is the only sstable to ingest into L0, it seems that's where the memtable overlap was:

  added:         L0 4879580:1181<#28463980877-#28463980877>[/Local/RangeID/318347/u""/0,0#28463980877,RANGEDEL-/Local/RangeID/318347/v""/0,0#inf,RANGEDEL] (2023-05-22T21:34:55Z)

In the final 1% of the previous WAL 4879549.log, there are two batches containing overlapping keys:

12083692(48) seq=28463980624 count=1
    SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)
12084163(52) seq=28463980626 count=1
    SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)

Are these keys expected to be written before the replica's snapshot is ingested? Is there someway we could avoid writing them until the snapshot has been ingested?

erikgrinaker commented 1 year ago

@pavelkalinnikov Can you have a look at the above?

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

pav-kv commented 1 year ago
12083692(48) seq=28463980624 count=1
    SET(/Local/RangeID/318347/u"rftr"/0,0,Term:0 Index:0 Type:6 : EMPTY
)

This key is always written by getOrCreateReplica when a replica is first "encountered" on a store (either by means of a split, or snapshot application). This includes the case when a replica is initialized from a snapshot below the HandleSnapshot call:

Only then we handle raft ready https://github.com/cockroachdb/cockroach/blob/d6ca1394d1fbbda2bf3ce3f5bfe1a78d096719bf/pkg/kv/kvserver/store_raft.go#L465

which eventually ingests the snapshot.

So, one possible way to optimize is avoiding this write, because the snapshot ingestion writes this key anyway. This doesn't seem trivial at a quick glance though, because of the way getOrCreateReplica is coupled with other paths. Looking though. Perhaps we could avoid writing the range deletes in applySnapshot if we know the keys are not present.

12084163(52) seq=28463980626 count=1
    SET(/Local/RangeID/318347/u/RaftHardState/0,0,Term:0 Index:0 Type:7 : EMPTY
)

This key, I'm not sure why it's already written. The getOrCreateReplica path doesn't write it, so it must have been something else.

sumeerbhola commented 1 year ago

RaftHardStateKey and RaftReplicaIDKey are both written for an uninitialized replica (see CreateUninitializedReplica)

pav-kv commented 1 year ago

@sumeerbhola I can only see the RaftReplicaIDKey write there. I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere (hence we see them in separate batches in @jbowens's comment?):

Before any of these flows, it's possible that only RaftReplicaIDKey is written (e.g. we've got some requests for this replica before it got initialized)?

sumeerbhola commented 1 year ago

I think the RaftHardStateKey is written by each of the two initialization flows, but elsewhere

The HardState must be getting written somewhere for an uninitialized replica too, since I think an uninitialized replica can vote (so needs to remember it). This is also the reason for the dance in https://github.com/cockroachdb/cockroach/blob/f397cf964c8d41f4b5eb90fe330298ce65fe818b/pkg/kv/kvserver/store_split.go#L69-L81

pav-kv commented 1 year ago

Another possible scenario:

  1. getOrCreateReplica creates an unitialized replica and writes the RaftReplicaIDKey
  2. Then this replica votes (it's almost the only thing it can do) and stores the HardState
  3. (alternatively) I believe it can store a HardState even without voting if it discovers a non-zero Term or Commit from other nodes (I vaguely remember seeing this happen)
pav-kv commented 1 year ago

@sumeerbhola Yeah, I think you're right. In the scenario above both keys can be written, and in separate batches.

pav-kv commented 1 year ago

We've discussed getting rid of uninitialized replicas (or at least make them not have any IO) previously. Could be good to reiterate on this. No uninit replicas / IO => no RaftReplicaIDKey / HardState keys written before ingestion => more efficient ingestion.

sumeerbhola commented 1 year ago

I am wondering how much of the work in writeUnreplicatedSST is optional for the case of a new uninitialized range followed by initialization https://github.com/cockroachdb/cockroach/blob/f397cf964c8d41f4b5eb90fe330298ce65fe818b/pkg/kv/kvserver/replica_raftstorage.go#L750-L761

There will be no raft log, RaftTruncatedStateKey, RangeLastReplicaGCTimestampKey. We can read RaftReplicaIDKey to check that the replica id hasn't changed. And if the HardState has not changed we could avoid doing any of that work (fast-path). But perhaps HardState.Commit will change since IIRC, that value must be 0 for an uninitialized replica.

btw, are we accidentally clearing the RangeTombstoneKey, which would be a bug?

nvanbenschoten commented 1 year ago

To verify the theories from above, I added the following instrumentation to snapshot ingestion and spun up a multi-node local cluster:

--- a/pkg/kv/kvserver/replica_raftstorage.go
+++ b/pkg/kv/kvserver/replica_raftstorage.go
@@ -514,6 +514,33 @@ func (r *Replica) applySnapshot(
                log.Infof(ctx, "applied %s (%s)", inSnap, logDetails)
        }(timeutil.Now())

+       // Look at what keys are already written into the unreplicated key space.
+       {
+               unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.ID().RangeID)
+               unreplicatedStart := unreplicatedPrefixKey
+               unreplicatedEnd := unreplicatedPrefixKey.PrefixEnd()
+               it := r.store.TODOEngine().NewEngineIterator(storage.IterOptions{UpperBound: unreplicatedEnd})
+               defer it.Close()
+               var ok bool
+               var err error
+               for ok, err = it.SeekEngineKeyGE(storage.EngineKey{Key: unreplicatedStart}); ok; ok, err = it.NextEngineKey() {
+                       key, err := it.UnsafeEngineKey()
+                       if err != nil {
+                               panic(err)
+                       }
+                       log.Infof(ctx, "found unreplicated key: %s", key.Key)
+               }
+               if err != nil {
+                       panic(err)
+               }
+
+               prevHS, err := r.raftMu.stateLoader.StateLoader.LoadHardState(ctx, r.store.TODOEngine())
+               if err != nil {
+                       panic(err)
+               }
+               log.Infof(ctx, "found prev HardState: %+v", prevHS)
+       }
+
        unreplicatedSSTFile, nonempty, err := writeUnreplicatedSST(

Each INITIAL snapshot (during upreplication) looked the same. Each had a RaftHardStateKey and a RaftReplicaIDKey. The HardState's vote was 0 (the common-case, but not always true), a commit index was 0 (I believe this is always the case), and a term of 6.

I230608 04:22:04.625626 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 128  found unreplicated key: /Local/RangeID‹/13›/‹u›/‹RaftHardState›
I230608 04:22:04.625641 1331 kv/kvserver/replica_raftstorage.go:531 ⋮ [T1,n2,s2,r13/2:{-}] 129  found unreplicated key: /Local/RangeID‹/13›/‹u›‹"rftr"›
I230608 04:22:04.625674 1331 kv/kvserver/replica_raftstorage.go:541 ⋮ [T1,n2,s2,r13/2:{-}] 130  found prev HardState: {Term:6 Vote:0 Commit:0}
I230608 04:22:04.684671 1331 kv/kvserver/replica_raftstorage.go:514 ⋮ [T1,n2,s2,r13/2:‹/Table/1{1-2}›] 131  applied INITIAL snapshot 35998b43 from (n1,s1):1 at applied index 153 (total=59ms data=1.8 KiB ingestion=6@49ms)

The RaftReplicaIDKey is written during replica creation. It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.

It is less clear what we should do with the RaftHardStateKey. Snapshots are always accompanied by an updated HardState whose commit index is set to the index of the snapshot. We currently write this updated HardState into the snapshot's ssts, so that the HardState is updated atomically with the snapshot ingestion. Is this necessary for correctness? Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?

jbowens commented 1 year ago

It shouldn't be changed by the snapshot, so in theory, if we avoided deleting it by splitting the range deletion, we could also avoid re-writing it.

We're hoping to move to remove the range deletion, replacing it with an atomic ingest+excise operation that virtualizes overlapping sstables to hide existing data. This will allow ingested snapshot sstables to unconditionally ingest into L6. While we could separately ingest+excise on either side of this key, it'd be preferable to avoid the tiny files.

sumeerbhola commented 1 year ago

Typically the commit index in the HardState does not require durability and can be reconstructed after a crash. Does that provide us any flexibility to write the new HardState in a separate batch, either before or after the SST ingestion in order to avoid the key collision with the memtable during ingestion?

The ReplicasStorage (separated raft log) design included an init step at crash recovery time that would see if HardState.Commit < RangeAppliedState.RaftAppliedIndex and if so update it to make it equal to RangeAppliedState.RaftAppliedIndex. I don't know if we already do something like this. If we do (or can), we could check that the only thing that needs updating is HardState.Commit (as you outlined above) and then write it after the ingest (the separated raft log design calls for the invariant that HardState.Commit is 0 for an uninitialized replica, so I don't think we should write it before the ingest). @tbg

tbg commented 1 year ago

Typically the commit index in the HardState does not require durability and can be reconstructed after a crash.

I would tread careful here due to apply-time conf changes. There is a secret invariant that you can't have more than one "committed but not known committed" confchange in the log. I don't know how much this can be relaxed but either way I am not terribly motivated to relax anything in that area, since it is generally poorly understood and problems would not be discovered until it is potentially too late.

I was actually thinking the other day that we should just get out of the business of apply-time conf changes. I filed an issue^1 to that effect. With that, the Committed index would be entirely discretionary, I think, mod, perhaps, additional meaning we've attached to it with replica quiescence, which one day might be on its way out^2, too.