cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.02k stars 3.79k forks source link

roachtest: failover/non-system/blackhole-recv failed #100101

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.failover/non-system/blackhole-recv failed with artifacts on release-23.1 @ aec78f33d45a8376a0ecec885688bae60dbfb85c:

test artifacts and logs in: /artifacts/failover/non-system/blackhole-recv/run_1
(cluster.go:1977).Run: output in run_125335.248437892_n5_sudo-iptables-F: sudo iptables -F returned: context canceled
(cluster.go:1977).Run: output in run_124335.247893726_n7_cockroach-workload-r: ./cockroach workload run kv --read-percent 50 --duration 20m --concurrency 256 --max-rate 2048 --timeout 1m --tolerate-errors --histograms=perf/stats.json {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 7)
(cluster.go:1977).Run: cluster.RunE: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-26249

Epic CRDB-27234

pav-kv commented 1 year ago

@renatolabs Why is this closed? Is this a duplicate?

pav-kv commented 1 year ago

Closed timestamp regression:

W230330 12:53:35.132881 277928 kv/kvserver/closedts/sidetransport/receiver.go:139 â‹® [n6] 2175  closed timestamps side-transport connection dropped from node: 4
F230330 12:53:35.116584 314094 kv/kvserver/replica_closedts.go:263 â‹® [n6] 2164  side-transport update saw closed timestamp regression on r716: (lai=362, ts=1680180806.029715582,0) -> (lai=360, ts=1680180811.914306113,0)

Looks like an instance of #98698 (and dup of #99931).

renatolabs commented 1 year ago

@renatolabs Why is this closed? Is this a duplicate?

Hrmm, I closed this one by accident when cleaning up the failures related to the recent workload changes. I did realize the mistake and immediately reopened it, but it seems Exalate went ahead and closed it again. 🤦

Reopening.

nvanbenschoten commented 1 year ago

Removing the release-blocker label now that we've added back on https://github.com/cockroachdb/cockroach/issues/98698.

pav-kv commented 1 year ago

The artifacts here are particularly useful, there is a debug.zip.

There is the r716 lease history (while I'm looking at it):

{
  "lease_history": [
    {
      "start": {
        "wall_time": 1680180189854194400
      },
      "replica": {
        "node_id": 4,
        "store_id": 4,
        "replica_id": 1,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1680180189858458600
      },
      "epoch": 1,
      "sequence": 11,
      "acquisition_type": 2
    },
    {
      "start": {
        "wall_time": 1680180195234137900
      },
      "expiration": {
        "wall_time": 1680180201234054000
      },
      "replica": {
        "node_id": 5,
        "store_id": 5,
        "replica_id": 2,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1680180195234054000
      },
      "sequence": 12,
      "acquisition_type": 1
    },
    {
      "start": {
        "wall_time": 1680180195234137900
      },
      "replica": {
        "node_id": 5,
        "store_id": 5,
        "replica_id": 2,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1680180195243592700
      },
      "epoch": 1,
      "sequence": 13,
      "acquisition_type": 2
    },
    {
      "start": {
        "wall_time": 1680180762032166000
      },
      "expiration": {
        "wall_time": 1680180768031890400
      },
      "replica": {
        "node_id": 6,
        "store_id": 6,
        "replica_id": 3,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1680180762031890400
      },
      "sequence": 14,
      "acquisition_type": 1
    },
    {
      "start": {
        "wall_time": 1680180768031890400,
        "logical": 1
      },
      "replica": {
        "node_id": 4,
        "store_id": 4,
        "replica_id": 1,
        "type": 0
      },
      "proposed_ts": {
        "wall_time": 1680180769939793400
      },
      "epoch": 1,
      "sequence": 15,
      "acquisition_type": 2
    }
  ]
}
pav-kv commented 1 year ago

Range 716 has replicas on nodes 4 (leader), 5 (follower), 6 (follower, the failed node).

pav-kv commented 1 year ago

Node 4 holds the latest lease from 1680180769(12:52:49), which is 37 sec earlier than the closed timestamp 1680180806 (12:53:26) that n6 knew. The problematic timestamp (with a lower LAI) is 12:53:31, which is 4s before the panic.

Notably, the panic happens right after the side-transport stream from n4 is closed on n6 [see the log above], so the problematic update it must be getting from a side-transport from another node.

The last couple of things happened before the panic is the test blackholing traffic to n5, and then reenabling it:

12:52:37 failover.go:285: test status: failing n5 (blackhole-recv)
12:53:35 failover.go:294: test status: recovering n5 (blackhole-recv)

Possibly the problematic update came from n5.

pav-kv commented 1 year ago

r716 is quiescent on n4 and n5 at 1680180878 (12:54:38) when the debug.zip was taken, this is ~1 min after the panic. But possibly, the range was quiescent during the panic too.

pav-kv commented 1 year ago

From 12:50:07, nodes 4 and 5 keep transferring leases to node 6 (and node 6 stays the only node to which leases are being transferred, up until the panic at 12:53:35):

replicate_queue.go:2037 ⋮ [T1,n4,replicate,s4,r779/3:‹/Table/106/1/-24{4175…-2332…}›] 3965  transferring lease to s6

Probably because n6 was recovered from the blackhole shortly before 12:50:

12:44:37 failover.go:285: test status: failing n4 (blackhole-recv)
12:45:35 failover.go:294: test status: recovering n4 (blackhole-recv)
12:46:35 failover.go:285: test status: failing n5 (blackhole-recv)
12:47:35 failover.go:294: test status: recovering n5 (blackhole-recv)
12:48:35 failover.go:285: test status: failing n6 (blackhole-recv)
12:49:35 failover.go:294: test status: recovering n6 (blackhole-recv)
12:50:37 failover.go:285: test status: failing n4 (blackhole-recv)
12:51:35 failover.go:294: test status: recovering n4 (blackhole-recv)
12:52:37 failover.go:285: test status: failing n5 (blackhole-recv)
12:53:35 failover.go:294: test status: recovering n5 (blackhole-recv)

The last events relating r716 like this were:

I230330 12:52:42.031844 replicate_queue.go:2037 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 3555  transferring lease to s6
...
I230330 12:53:30.981639 replicate_queue.go:842 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 3556  error processing replica: [n5,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›]: unable to transfer lease to s6: refusing to transfer lease to (n6,s6):3 because target may need a Raft snapshot: ‹replica in StateProbe›‹›
E230330 12:53:30.981772 queue.go:1128 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 1729  [n5,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›]: unable to transfer lease to s6: refusing to transfer lease to (n6,s6):3 because target may need a Raft snapshot: ‹replica in StateProbe›

This is exactly when n5 is in the blackhole (but only for receive traffic).

pav-kv commented 1 year ago

Some more logs all merged:

12:43:15.234 r716 lease at n5 started

12:44:37 failover.go:285: test status: failing n4 (blackhole-recv)
12:45:35 failover.go:294: test status: recovering n4 (blackhole-recv)
12:46:35 failover.go:285: test status: failing n5 (blackhole-recv)
12:47:35 failover.go:294: test status: recovering n5 (blackhole-recv)
12:48:35 failover.go:285: test status: failing n6 (blackhole-recv)
12:48:46.194861 [n5,r118] incremented n6 liveness epoch to 2
12:49:35 failover.go:294: test status: recovering n6 (blackhole-recv)
12:50:37 failover.go:285: test status: failing n4 (blackhole-recv)
12:51:35 failover.go:294: test status: recovering n4 (blackhole-recv)
12:52:37 failover.go:285: test status: failing n5 (blackhole-recv)

// a bunch of errors like 
12:52:41.027754 receiver.go:139 ⋮ [T1,n5] 1515  closed timestamps side-transport connection dropped from node: 6
// or
12:52:45.284134 receiver.go:139 ⋮ [T1,n4] 1885  closed timestamps side-transport connection dropped from node: 5
// or
12:52:53.487552 sender.go:795 ⋮ [T1,n6,ctstream=5] 2090  side-transport failed to connect to n5: failed to connect to n5 at ‹10.142.0.39:26257›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.0.39:26257: i/o timeout"› [code 14/Unavailable]
// or
12:52:55.733427 sender.go:795 ⋮ [T1,n5,ctstream=4] 1588  side-transport failed to connect to n4: unable to dial n4: ‹breaker open›
// i.e., all related to the fact that n5 is blackholed

I230330 12:52:42.031844 replicate_queue.go:2037 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 3555  transferring lease to s6

# interesting block

12:52:42.032 r716 lease at n6 started (the cooperative transfer above succeeded)
12:52:48     r716 lease at n4 started (the last one)

12:53:26.029 the first of the two conflicting updates (LAI = 362)
+3s (target closed ts duration) is when it was generated/sent

# note that n5 is still disconnected and thinks it was unable to transfer lease

I230330 12:53:30.981639 replicate_queue.go:842 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 3556  error processing replica: [n5,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›]: unable to transfer lease to s6: refusing to transfer lease to (n6,s6):3 because target may need a Raft snapshot: ‹replica in StateProbe›‹›
E230330 12:53:30.981772 queue.go:1128 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 1729  [n5,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›]: unable to transfer lease to s6: refusing to transfer lease to (n6,s6):3 because target may need a Raft snapshot: ‹replica in StateProbe›

12:53:31.914 the second conflicting/regressing update (LAI = 360)
+3s (target closed ts duration) is when it was generated/sent

12:53:35 failover.go:294: test status: recovering n5 (blackhole-recv)

W230330 12:53:35.132881 277928 kv/kvserver/closedts/sidetransport/receiver.go:139 â‹® [n6] 2175  closed timestamps side-transport connection dropped from node: 4
F230330 12:53:35.116584 314094 kv/kvserver/replica_closedts.go:263 â‹® [n6] 2164  side-transport update saw closed timestamp regression on r716: (lai=362, ts=1680180806.029715582,0) -> (lai=360, ts=1680180811.914306113,0)

12:53:40.276768 [n5,s5,r113] incremented n6 liveness epoch to 3

It's unclear why at 12:53:35 the side-transport from n4 to n6 was dropped. Node 4 was not under blackhole for the last 2 min.

pav-kv commented 1 year ago

Nodes 4 and 5 show the same closed timestamp info:

"closed_timestamp_sidetransport_info": {
  "replica_closed": {
    "wall_time": 1680180875083530829  // 12:54:35.083 (1 minute after the crash)
  },
  "replica_lai": 363,
  "central_closed": {}
}

Notably, the LAI has advanced by 1 since then (compared to the LAI that killed n6).

pav-kv commented 1 year ago

Better formatted log with interesting events around leaseholder shuffling:

12:43:15.234 r716 lease at n5 started

12:44:37 failover.go:285: test status: failing n4 (blackhole-recv)
12:45:35 failover.go:294: test status: recovering n4 (blackhole-recv)
12:46:35 failover.go:285: test status: failing n5 (blackhole-recv)
12:47:35 failover.go:294: test status: recovering n5 (blackhole-recv)
12:48:35 failover.go:285: test status: failing n6 (blackhole-recv)
12:48:46.194861 [n5,r118] incremented n6 liveness epoch to 2
12:49:35 failover.go:294: test status: recovering n6 (blackhole-recv)
12:50:37 failover.go:285: test status: failing n4 (blackhole-recv)
12:51:35 failover.go:294: test status: recovering n4 (blackhole-recv)
12:52:37 failover.go:285: test status: failing n5 (blackhole-recv)

12:52:42.031844 [n5,r716] transferring lease to s6
12:52:42.032    r716 proposed lease at n6 start time
                [lease expiration time = 12:52:48.031 = start of lease on n4]
12:52:48.031    r716 lease at n4 start time

12:53:26.029 the timestamp of CT update with LAI=362
12:53:29.029 [+3s] is probably when it was generated/sent

# looks like the lease transf request at 12:52:42 only got evaluated now?
# How did it end up in the lease history if it was rejected here?
12:53:30.981639 [n5,r716] error processing replica: <>: unable to transfer lease to s6: refusing to transfer lease to (n6,s6):3 because target may need a Raft snapshot: ‹replica in StateProbe›‹›

12:53:31.741976 [n6] closed timestamps side-transport connection dropped from node: 5

12:53:31.914 the timestamp of CT update with LAI=360 (regression)
             [about 5.8846s apart from the previous timestamp, suspiciously close to 6s]
12:53:32.232139 [T1,n5,rnode=6] connection is now ready
12:53:34.914 [+3s] is probably when it was generated/sent

# want to increment expiration to 12:53:34.981 (in the past), but already have 12:53:40.983
12:53:34.986817 [n4,r129] failed to increment leaseholder's epoch: mismatch incrementing epoch for ‹liveness(nid:5 epo:1 exp:1680180814.981710470,0)›; actual is ‹liveness(nid:5 epo:1 exp:1680180820.983477229,0)›
12:53:34.987456 [n4,r590] failed to increment leaseholder's epoch: mismatch incrementing epoch for ‹liveness(nid:5 epo:1 exp:1680180814.981710470,0)›; actual is ‹liveness(nid:5 epo:1 exp:1680180820.983477229,0)›

12:53:35 failover.go:294: test status: recovering n5 (blackhole-recv)

12:53:35.132881 [n6] closed timestamps side-transport connection dropped from node: 4
12:53:35.116546 [n6] the server is terminating due to a fatal error (see the DEV channel for details)
12:53:35.116584 [n6] side-transport update saw closed timestamp regression on r716: (lai=362, ts=1680180806.029715582,0) -> (lai=360, ts=1680180811.914306113,0)

12:53:40.276768 [n5,s5,r113] incremented n6 liveness epoch to 3

I found a stack trace of the "unable to transfer lease to s6" error in #98698 logs, which has a similar situation. It goes all the way through the proposal buf until rejected.

[T1,n5,s5,store-rebalancer] 5168  unable to transfer lease to s4: [n5,s5,r1049/3:‹/Table/106/1/-43{3986…-2144…}›]: unable to transfer lease to s4: refusing to transfer lease to (n4,s4):2 because target may need a Raft snapshot: ‹replica in StateProbe›
[T1,n5,s5,store-rebalancer] 1936  unable to transfer lease to s4: [n5,s5,r1049/3:‹/Table/106/1/-43{3986…-2144…}›]: unable to transfer lease to s4: refusing to transfer lease to (n4,s4):2 because target may need a Raft snapshot: ‹replica in StateProbe›
[T1,n5,s5,store-rebalancer]
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicateQueue).TransferLease
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replicate_queue.go:1998
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).applyLeaseRebalance.func1
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:480
  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
  |     github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).applyLeaseRebalance
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:479
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:374
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).Start.func1
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:273
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  |     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
  | runtime.goexit
  |     GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (2) [n5,s5,r1049/3:‹/Table/106/1/-43{3986…-2144…}›]: unable to transfer lease to s4
Wraps: (3) forced error mark
  | ‹"lease transfer rejected because the target may need a snapshot"›
  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
Wraps: (4)
  | (opaque error wrapper)
  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
  | reportable 0:
  |
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.NewLeaseTransferRejectedBecauseTargetMayNeedSnapshotError
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_range_lease.go:1019
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaProposer).rejectProposalWithLeaseTransferRejectedLocked
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:1291
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*propBuf).maybeRejectUnsafeProposalLocked
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:710
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*propBuf).FlushLockedWithRaftGroup
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:429
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked.func2
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:764
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).withRaftGroupLocked.func1
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1937
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).withRaftGroupLocked
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1938
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:743
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:699
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:644
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker
  |     github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:333
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  |     github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
  | runtime.goexit
  |     GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (5) refusing to transfer lease to (n4,s4):2 because target may need a Raft snapshot: ‹replica in StateProbe›
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *markers.withMark (4) *errbase.opaqueWrapper (5) *errutil.leafError
pav-kv commented 1 year ago

Notably, both in this failure, and in the other one, the problematic range is almost the only one that shows up in the logs with the unable to transfer lease to message.

In both failures, node 5 (while under recv blackhole) attempts to transfer lease for this range and thinks that it failed to. But here we see that it actually succeeds. Shortly after that, the CT regression fires.

In this issue's logs, there are only 2 instances of such message. It's possible that we're catching a panic only after the second one because it happens to have a more unlucky distribution of roles: e.g. leader is on a different node, or made a bad move.


In the other failure #98698 logs, there are multiple lease transfer errors, but only the problematic range thinks it failed to move lease during a blackhole, and only that one had because target may need a Raft snapshot as the reason. Other ranges attempted to do it shortly after the blackhole went away, and failed with a slightly different error:

02:39:24.499096 [T1,n5,replicate,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›] 5074  error processing replica: [n5,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›]: unable to transfer lease to s4: cannot replace lease repl=(n4,s4):3 seq=17 start=1678502327.939837965,0 epo=2 pro=1678502327.939837965,0 with repl=(n4,s4):3 seq=0 start=1678502364.498227325,0 exp=1678502312.943378934,0 pro=1678502306.943378934,0: ‹illegal lease: epoch=0, interval=[1678502364.498227325,0, 1678502312.943378934,0)›‹›

The proposed expiration-based lease contains out-of-order timestamps (expiration < start). Not sure why this can happen.

pav-kv commented 1 year ago

The corresponding metric:

nodes/1/crdb_internal.node_metrics.txt:1        leases.transfers.error  0
nodes/2/crdb_internal.node_metrics.txt:2        leases.transfers.error  0
nodes/3/crdb_internal.node_metrics.txt:3        leases.transfers.error  0
nodes/4/crdb_internal.node_metrics.txt:4        leases.transfers.error  1
nodes/5/crdb_internal.node_metrics.txt:5        leases.transfers.error  1  <--
erikgrinaker commented 1 year ago

In both failures, node 5 (while under recv blackhole) attempts to transfer lease for this range and thinks that it failed to. But here we see that it actually succeeds.

It can't have succeeded though, because that error prevents it from issuing the lease transfer request -- it errors out first.

https://github.com/cockroachdb/cockroach/blob/186a85ce5d6a5485f4f9585d6412df11efefa52d/pkg/kv/kvserver/replica_range_lease.go#L913-L923

So who transferred the lease?

erikgrinaker commented 1 year ago

The proposed expiration-based lease contains out-of-order timestamps (expiration < start). Not sure why this can happen.

This is a bug. We should find out why this happened.

pav-kv commented 1 year ago

It can't have succeeded though, because that error prevents it from issuing the lease transfer request -- it errors out first. So who transferred the lease?

The lease transfer happened 48s before the message saying the transfer failed: https://github.com/cockroachdb/cockroach/issues/100101#issuecomment-1516526268. Maybe the first transfer somehow went through.

Also you can see that the lease transfer is cooperative, and initiated by n5's replication queue:

I230330 12:52:42.031844 replicate_queue.go:2037 ⋮ [T1,n5,replicate,s5,r716/2:‹/Table/106/1/8{39409…-41252…}›] 3555  transferring lease to s6

The fact that n5 is still trying to transfer lease 48s after indicates that it wasn't aware of the transfer success.

erikgrinaker commented 1 year ago

I think you're right that the problem is that we're not bumping n5's epoch, but the details are still a bit fuzzy to me.

I think what happens is that n4 and n6 are able to establish a new lease on n4, and that n5 doesn't hear about this new lease. We also don't bump n5's epoch, because n5 can still heartbeat, so there's no reason for n4/n6 to bump its epoch. We're then in a situation where both n4 and n5 think they have the lease. This is really, really bad.

I also think that both n4 and n5 think they're the Raft leader. n4 is the real leader, but n5 still thinks it's the leader too because noone has told it otherwise. That's sort of okay though, and expected without CheckQuorum.

erikgrinaker commented 1 year ago

The thing I can't square here is that when n5 transfers the lease to n6, it must have applied that lease locally.

12:52:42.031844 [n5,r716] transferring lease to s6
12:52:42.032    r716 lease at n6 started (the cooperative transfer above succeeded)
                [lease expiration time = 12:52:48.031 = start of lease on n4]

As long as n5 is the leader, n6 will be able to apply this lease, but it can't upgrade it to an epoch lease. That's probably why n4 is able to acquire it later.

Furthermore, this is 5 seconds after we blackholed n5 -- how can n5 even commit this lease when it can't get acks from the other replicas? Maybe we return the ack as a response on the outbound connection, I'll have to check.

Anyway, when n5 applies the lease, it should also transfer Raft leadership to n6:

https://github.com/cockroachdb/cockroach/blob/d3afb8b26f9d09adda30b5c9ef1a3ae799139cf9/pkg/kv/kvserver/replica_proposal.go#L410-L413

So at this point, n5 should know about n6's new lease, and it should have transferred leadership to n6. How does it continue to believe that it's both a leader and leaseholder?

erikgrinaker commented 1 year ago

The thing I can't square here is that when n5 transfers the lease to n6, it must have applied that lease locally.

Pavel points out that this only holds if n5 was the Raft leader at the time. It probably was, but if it wasn't, then it's possible for it to make a lease transfer but never apply the new lease.

erikgrinaker commented 1 year ago

Pavel also points out that if n5 was the leader, it would be able to replicate the lease to n4/n6, but it wouldn't be able to commit/apply it because it didn't receive acks. However, if there was a later leader change then the new leader could commit and apply that lease, in which case we have overlapping leases.

This sounds plausible, we should try to confirm this.

pav-kv commented 1 year ago

The proposed expiration-based lease contains out-of-order timestamps (expiration < start). Not sure why this can happen.

This is a bug. We should find out why this happened.

I230311 02:39:24.499096 522154 13@kv/kvserver/replicate_queue.go:819 ⋮ [T1,n5,replicate,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›] 5074  error processing replica: [n5,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›]: unable to transfer lease to s4: cannot replace lease repl=(n4,s4):3 seq=17 start=1678502327.939837965,0 epo=2 pro=1678502327.939837965,0 with repl=(n4,s4):3 seq=0 start=1678502364.498227325,0 exp=1678502312.943378934,0 pro=1678502306.943378934,0: ‹illegal lease: epoch=0, interval=[1678502364.498227325,0, 1678502312.943378934,0)›‹›

The start timestamp is 02:39:24.498, which is 1ms before the timestamp of this log message. Expiration time is 02:38:32.943, which is ~52s before that.

I think the start timestamp was generated during the TransferLease eval, while the Expiration is set up the stack.

I can only see two reasonable places where Expiration is set:

If I understand correctly, the InitOrJoinRequest flow is before eval, so the Now used to set the Expiration timestamp can end up in the past by the time we evaluate.

Maybe something is blocking this request for ~1min before it gets evaluated. Like some latches.

pav-kv commented 1 year ago

This kinda checks out:

I230311 02:38:26.943329 522154 13@kv/kvserver/replicate_queue.go:1996 ⋮ [T1,n5,replicate,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›] 5071  transferring lease to s4
W230311 02:38:41.944254 522155 kv/kvserver/spanlatch/manager.go:559 ⋮ [T1,n5,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›] 1395  have been waiting 15s to acquire ‹write› latch ‹/{Local-Meta1/}@0,0›, held by ‹write› latch ‹/Local/RangeID/828/u{/RaftLog-"rftm"}@0,0›
I230311 02:39:24.499096 522154 13@kv/kvserver/replicate_queue.go:819 ⋮ [T1,n5,replicate,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›] 5074  error processing replica: [n5,s5,r828/1:‹/Table/106/1/6{18269…-20112…}›]: unable to transfer lease to s4: cannot replace lease repl=(n4,s4):3 seq=17 start=1678502327.939837965,0 epo=2 pro=1678502327.939837965,0 with repl=(n4,s4):3 seq=0 start=1678502364.498227325,0 exp=1678502312.943378934,0 pro=1678502306.943378934,0: ‹illegal lease: epoch=0, interval=[1678502364.498227325,0, 1678502312.943378934,0)›‹›

We see that we started transferring the lease 58s ago. In the meantime there has been some latches contention. So, only after almost a minute we got to actually evaluate the request (because it has to latch the entire range). By then, the Expiration was already (58s - 6s = 52s) in the past.

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

pav-kv commented 1 year ago

I ran roachstress on this test with a bunch of added logging, and caught a repro. Here are some key events with comments:

...
12:03:02 failover.go:288: test status: failing n4 (blackhole-recv)
12:03:59 failover.go:297: test status: recovering n4 (blackhole-recv)
12:05:02 failover.go:288: test status: failing n5 (blackhole-recv)
12:05:59 failover.go:297: test status: recovering n5 (blackhole-recv)
12:06:59 failover.go:288: test status: failing n6 (blackhole-recv)
// from here until the panic, n6 can send traffic, but can not receive

// r104 is the range that will have a closed timestamp regression

// n6 initiates a lease transfer for r104 from n6 to n5
// new lease starts at 12:07:03.058, expires 12:07:09.058
// RevokeLease bumps minLeaseProposedTS on n6 to 12:07:03.058
12:07:03.058379 [n6,replicate,r104/2] transferring lease to s5
12:07:03.058584 [n6,r104/2] TransferLease:
    prev repl=(n6,s6):2 seq=10 start=1682510616.148561207,0 epo=3 pro=1682510616.148561207,0
    pending repl=(n5,s5):1 seq=0 start=1682510823.058404636,0 exp=1682510829.058404636,0 pro=1682510823.058404636,0
    args.PrevLease repl=(n6,s6):2 seq=10 start=1682510616.148561207,0 epo=3 pro=1682510616.148561207,0
    args.Lease repl=(n5,s5):1 seq=0 start=1682510823.058404636,0 exp=1682510829.058404636,0 pro=1682510823.058404636,0
12:07:03.058674 [n6,r104/2] TransferLease: transferring lease while there is a pending one
12:07:03.058692 [n6,r104/2] RevokeLease lease at seq 10, current lease: repl=(n6,s6):2 seq=10 start=1682510616.148561207,0 epo=3 pro=1682510616.148561207,0
12:07:03.058720 [n6,r104/2] RevokeLease bumping minLeaseProposedTS to 1682510823.058720514,0

// shortly after, n6 removes range 104 from tracking
// probably it realised the lease was revoked
12:07:03.121974 [n4] received side-transport update: ‹Seq num: 5493, sending node: n6, snapshot: false, size: 40 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510820.120716680,0 (3.001s ago)
+‹Removed (1 ranges): r104›
currently tracked: 104:{lai:‹923› policy:‹0›}

// n5 side-transport started tracking range 104, so it probably took the lease
// It took a while though, the closed timestamp 12:07:17.332 is past the initial
// n5 lease expiration time. Has it been refreshed/upgraded to epoch-based?
12:07:20.332592 [n4] received side-transport update: ‹Seq num: 5594, sending node: n5, snapshot: false, size: 44 bytes, c
losed timestamps: LAG_BY_CLUSTER_SETTING:1682510837.332201534,0 (3s ago)
+‹Added or updated (1 ranges): (<range>:<LAI>) 104:924›

// 19s after lease transfer to n5,
// the first offending update comes from n5 (closedTS = 12:07:19.340)
12:07:22.340471 [n4] received side-transport update: ‹Seq num: 5604, sending node: n5, snapshot: false, size: 37 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510839.340090138,0 (3s ago)
currently tracked: 104:{lai:‹924› policy:‹0›}

// 23s after the lease transfer, n6 thinks it was unable to transfer the lease
12:07:26.252215 [n6,replicate,r104/2] unable to transfer lease to s5: refusing to transfer lease to (n5,s5):1 because target may need a Raft snapshot: ‹replica in StateProbe›

// but we know that n5 still holds it
12:07:26.754418 [n4] received side-transport update: ‹Seq num: 5626, sending node: n5, snapshot: false, size: 37 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510843.754049138,0 (3s ago), LEAD_FOR_GLOBAL_READS:1682510847.554049138,0? (799ms in the future)›
currently tracked: 104:{lai:‹924› policy:‹0›}

// in the meantime, n6 starts sending updates for r104 again, for some reason
//   it's in the "Added or updated" section, so this is the first update after the
//   Removed that we've seen above
// the timestamp is 12:07:23.899, which is already beyond the lease transfer timestamp
// and minLeaseProposedTS
12:07:26.900751 [n4] received side-transport update: ‹Seq num: 5611, sending node: n6, snapshot: false, size: 44 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510843.899879407,0 (3s ago)
+‹Added or updated (1 ranges): (<range>:<LAI>) 104:923›

// another update from n6, this is the second offending update
// it will trigger closed timestamp regression because the next update removes
// r104 from the tracker, and triggers ForwardSideTransportClosedTimestampForRange
12:07:27.102091 [n4] received side-transport update: ‹Seq num: 5612, sending node: n6, snapshot: false, size: 37 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510844.101182111,0 (3s ago)
+ added/removed 0
Currently tracked: {… 104:{lai:‹923› policy:‹0›} …}

// another update from n5
12:07:27.156720 [n4] received side-transport update: ‹Seq num: 5628, sending node: n5, snapshot: false, size: 37 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510844.156339675,0 (3s ago)
+ added/removed 0
Currently tracked: {… 104:{lai:‹924› policy:‹0›} …}

// n6 finally realised that it no longer holds the lease
// and removes r104 from tracking
// this triggers ForwardSideTransportClosedTimestampForRange
12:07:27.303593 [n4] received side-transport update: ‹Seq num: 5613, sending node: n6, snapshot: false, size: 1254 bytes, closed timestamps: LAG_BY_CLUSTER_SETTING:1682510844.302404526,0 (3.001s ago)
+‹Added or updated (0 ranges): (<range>:<LAI>) ›
+‹Removed (630 ranges): …, r104, …
Currently tracked: {… 104:{lai:‹923› policy:‹0›} …}

12:07:27.305037 [n4] the server is terminating due to a fatal error (see the DEV channel for details)
12:07:27.321043 [n4] side-transport update saw closed timestamp regression on r104: (lai=924, ts=1682510839.340090138,0) -> (lai=923, ts=1682510844.101182111,0)

So we clearly have a double-leaseholder situation (at least the side-transport is confused enough to get into such a state).

The question is: why does n6 think it's the leaseholder again at 12:07:26.900751? This comes very shortly after the 12:07:26.252215 message in which it (wrongly) realised it couldn't transfer the lease.

This comes down to answering why BumpSideTransportClosed returned OK: https://github.com/cockroachdb/cockroach/blob/793b4c8bfc6e9dbf7a05289f3e61035796476cc5/pkg/kv/kvserver/replica_closedts.go#L37

Could a liveness update for n6 bring the lease back to a "valid" state despite it has already revoked its lease?

pav-kv commented 1 year ago

Could a liveness update for n6 bring the lease back to a "valid" state despite it has already revoked its lease?

We have a hypothesis.

Closed timestamp side-transport sender bumps the closed timestamp if the lease status is in stasis period. However, in the code returning the lease status, the stasis check takes precedence over revocation. So, when now is [briefly] in stasis, side-transport will treat it as valid/active and will update the closed timestamp.

The remaining question is: how did n6 end up in stasis? Only the incoming traffic is blackholed, so it still should be able to send liveness heartbeats, and its local liveness cache should think that it's live.

The answer to that is in logs. Shortly before the failure, we see a liveness heartbeat timeout:

12:07:27.253517 [n6,liveness-hb] slow heartbeat took 3.001229644s; err=result is ambiguous: context done during DistSender.Send: dial: context deadline exceeded
12:07:27.253605 [n6,liveness-hb] failed node liveness heartbeat: ‹operation "node liveness heartbeat" timed out after 3.001s (given timeout 3s)›: result is ambiguous: context done during DistSender.Send: dial: context deadline exceeded

So here is the sequence:

pav-kv commented 1 year ago

The easy fix will be: swap the order of checks in the lease status func.

erikgrinaker commented 1 year ago

Excellent sleuthing! Still one mystery here.

// n5 side-transport started tracking range 104, so it probably took the lease // It took a while though, the closed timestamp 12:07:17.332 is past the initial // n5 lease expiration time. Has it been refreshed/upgraded to epoch-based? 12:07:20.332592 [n4] received side-transport update: ‹Seq num: 5594, sending node: n5, snapshot: false, size: 44 bytes, c losed timestamps: LAG_BY_CLUSTER_SETTING:1682510837.332201534,0 (3s ago) +‹Added or updated (1 ranges): (:) 104:924›

The hypothesis is that n6 replicated the lease transfer to n5+n4, but could not commit and apply it because it did not receive acks. We know that n6 must have been the Raft leader at the time of the transfer (or at least thought it was), because we check that the target is up-to-date, and only the leader can check this.

However, in order for n5 to actually apply the lease, n4 or n5 would have to take over Raft leadership and commit the lease. How did this happen? Did n6 just by chance happen to miss a Raft heartbeat? Did we incorrectly/improperly quiesce because of the now-reverted logic in https://github.com/cockroachdb/cockroach/pull/101899, such that the leader stopped heartbeating but the followers hadn't quiesced yet? In any case, the delay in applying the lease is probably because of this -- we had to lose our current Raft leader and elect a new one first.

I suppose it's plausible that we missed Raft heartbeats considering we also missed a liveness heartbeat. Do we know why these are failing?

erikgrinaker commented 1 year ago

(to be clear, I don't think we should spend a lot of time digging into the Raft leadership issue, because that's necessarily what must have happened -- I'm just curious)