Closed cockroach-teamcity closed 8 years ago
This one is pretty scary and reproduces very easily in GCE with:
make stress STRESSFLAGS='-p 100 -stderr -maxfails 1' PKG=./storage TESTS=TestReplicateRestartAfterTruncationWithRemoveAndReAdd
611 runs so far, 0 failures, over 5s
1167 runs so far, 0 failures, over 10s
1699 runs so far, 0 failures, over 15s
I161004 19:52:15.537426 1 rand.go:76 Random seed: -5025832341681872018
I161004 19:52:15.609866 25 storage/engine/rocksdb.go:360 opening in memory rocksdb instance
I161004 19:52:15.611986 25 gossip/gossip.go:234 initial resolvers: []
W161004 19:52:15.612017 25 gossip/gossip.go:1045 no resolvers found; use --join to specify a connected node
I161004 19:52:15.612102 25 gossip/gossip.go:268 NodeID set to 1
I161004 19:52:15.620763 25 gossip/gossip.go:268 NodeID set to 1
I161004 19:52:15.622814 25 storage/store.go:1040 [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161004 19:52:15.622997 25 gossip/gossip.go:282 NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:45150" > attrs:<>
I161004 19:52:15.625251 101 storage/replica_trigger.go:252 [s1,r1:/M{in-ax}] new range lease replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [physicalTime=1970-01-01 00:00:00 +0000 UTC]
I161004 19:52:15.625903 25 storage/engine/rocksdb.go:360 opening in memory rocksdb instance
I161004 19:52:15.626877 25 gossip/gossip.go:234 initial resolvers: [127.0.0.1:45150]
W161004 19:52:15.626922 25 gossip/gossip.go:1047 no incoming or outgoing connections
I161004 19:52:15.626989 25 gossip/gossip.go:268 NodeID set to 2
I161004 19:52:15.627852 25 gossip/gossip.go:268 NodeID set to 2
I161004 19:52:15.628093 25 storage/store.go:1040 [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I161004 19:52:15.628145 25 gossip/gossip.go:282 NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:34704" > attrs:<>
I161004 19:52:15.773880 25 storage/engine/rocksdb.go:360 opening in memory rocksdb instance
I161004 19:52:15.774910 25 gossip/gossip.go:234 initial resolvers: [127.0.0.1:45150]
W161004 19:52:15.797089 25 gossip/gossip.go:1047 no incoming or outgoing connections
I161004 19:52:15.797149 25 gossip/gossip.go:268 NodeID set to 3
I161004 19:52:15.798368 25 gossip/gossip.go:268 NodeID set to 3
I161004 19:52:15.798687 25 storage/store.go:1040 [n3,s3]: failed initial metrics computation: [n3,s3]: system config not yet available
I161004 19:52:15.798739 25 gossip/gossip.go:282 NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:45056" > attrs:<>
I161004 19:52:15.798945 56 gossip/client.go:125 node 2: started gossip client to 127.0.0.1:45150
I161004 19:52:15.799759 267 gossip/client.go:125 node 3: started gossip client to 127.0.0.1:45150
I161004 19:52:15.802885 25 storage/replica_raftstorage.go:443 [s1,r1:/M{in-ax}] generated snapshot a424e808 for range 1 at index 12 in 63.962µs.
I161004 19:52:15.810006 25 storage/store.go:2747 streamed snapshot: kv pairs: 31, log entries: 2
I161004 19:52:15.814200 86 storage/replica_raftstorage.go:573 [s2] [n2,s2,r1:{-}]: with replicaID [?], applying preemptive snapshot at index 12 (id=a424e808, encoded size=16, 1 rocksdb batches, 2 log entries)
I161004 19:52:15.814911 86 storage/replica_raftstorage.go:576 [s2] [n2,s2,r1:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.001s
I161004 19:52:15.821376 25 storage/replica_command.go:3236 change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2
I161004 19:52:15.823488 434 storage/replica.go:1712 [s1,r1:/M{in-ax}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161004 19:52:15.864750 87 storage/raft_transport.go:418 raft transport stream to node 1 established
I161004 19:52:15.865283 25 storage/replica_raftstorage.go:443 [s1,r1:/M{in-ax}] generated snapshot 73e739d1 for range 1 at index 14 in 164.209µs.
I161004 19:52:15.866326 25 storage/store.go:2747 streamed snapshot: kv pairs: 34, log entries: 4
I161004 19:52:15.866846 451 storage/replica_raftstorage.go:573 [s3] [n3,s3,r1:{-}]: with replicaID [?], applying preemptive snapshot at index 14 (id=73e739d1, encoded size=16, 1 rocksdb batches, 4 log entries)
I161004 19:52:15.868005 451 storage/replica_raftstorage.go:576 [s3] [n3,s3,r1:/M{in-ax}]: with replicaID [?], applied preemptive snapshot in 0.001s
I161004 19:52:15.874809 25 storage/replica_command.go:3236 change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3
I161004 19:52:15.903020 453 storage/replica.go:1712 [s1,r1:/M{in-ax}] proposing ADD_REPLICA {NodeID:3 StoreID:3 ReplicaID:3} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161004 19:52:15.924549 482 storage/raft_transport.go:418 raft transport stream to node 1 established
E161004 19:52:15.925325 25 storage/client_test.go:950 engine 1: missing key "a"
E161004 19:52:15.925368 25 storage/client_test.go:950 engine 2: missing key "a"
I161004 19:52:15.928355 25 util/stop/stopper.go:353 stop has been called, stopping or quiescing all running tasks
I161004 19:52:15.929785 25 storage/replica_command.go:3236 change replicas: read existing descriptor range_id:1 start_key:"" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:3 store_id:3 replica_id:3 > next_replica_id:4
W161004 19:52:15.931219 410 storage/store.go:2636 [s1] got error from range 1, replica {2 2 2}: storage/raft_transport.go:247: unable to accept Raft message from {NodeID:1 StoreID:1 ReplicaID:1}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}
W161004 19:52:15.931534 410 storage/store.go:2636 [s1] got error from range 1, replica {2 2 2}: storage/raft_transport.go:247: unable to accept Raft message from {NodeID:1 StoreID:1 ReplicaID:1}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}
I161004 19:52:15.936671 488 storage/replica.go:1712 [s1,r1:/M{in-ax}] proposing REMOVE_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:3 StoreID:3 ReplicaID:3}]
W161004 19:52:15.939282 410 storage/store.go:2636 [s1] got error from range 1, replica {2 2 2}: storage/raft_transport.go:247: unable to accept Raft message from {NodeID:1 StoreID:1 ReplicaID:1}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}
W161004 19:52:15.939335 410 storage/store.go:2636 [s1] got error from range 1, replica {2 2 2}: storage/raft_transport.go:247: unable to accept Raft message from {NodeID:1 StoreID:1 ReplicaID:1}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}
I161004 19:52:15.944480 25 gossip/gossip.go:268 NodeID set to 2
I161004 19:52:15.954336 25 storage/replica_raftstorage.go:443 [s1,r1:/M{in-ax}] generated snapshot a1061e88 for range 1 at index 22 in 54.207µs.
I161004 19:52:15.958691 25 storage/store.go:2747 streamed snapshot: kv pairs: 42, log entries: 3
I161004 19:52:15.958991 595 storage/replica_raftstorage.go:573 [s2] [n2,s2,r1:/M{in-ax}]: with replicaID 2, applying Raft snapshot at index 22 (id=a1061e88, encoded size=16, 1 rocksdb batches, 3 log entries)
E161004 19:52:15.971105 610 raft/raft.go:1091 [s2,r1:/M{in-ax}] 2 state.commit 22 is out of range [10, 17]
panic: 2 state.commit 22 is out of range [10, 17] [recovered]
panic: 2 state.commit 22 is out of range [10, 17]
goroutine 610 [running]:
panic(0x1615840, 0xc420d0cb70)
/home/tamird/go/src/runtime/panic.go:500 +0x1a1
github.com/cockroachdb/cockroach/util/stop.(*Stopper).Recover(0xc4205c2630)
/home/tamird/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:180 +0x6e
panic(0x1615840, 0xc420d0cb70)
/home/tamird/go/src/runtime/panic.go:458 +0x243
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc420d860e0, 0x180283d, 0x2b, 0xc420cfab00, 0x4, 0x4)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/raft.go:111 +0x107
github.com/coreos/etcd/raft.(*raft).loadState(0xc420346b40, 0x6, 0x0, 0x16, 0x0, 0x0, 0x0)
/home/tamird/src/github.com/coreos/etcd/raft/raft.go:1091 +0x1db
github.com/coreos/etcd/raft.newRaft(0xc420de1498, 0x40)
/home/tamird/src/github.com/coreos/etcd/raft/raft.go:289 +0xc9b
github.com/coreos/etcd/raft.NewRawNode(0xc420de1498, 0x0, 0x0, 0x0, 0xc420d90000, 0x16, 0x0)
/home/tamird/src/github.com/coreos/etcd/raft/rawnode.go:79 +0x71
github.com/cockroachdb/cockroach/storage.(*Replica).withRaftGroupLocked(0xc42009e700, 0x0, 0xc420de17e0, 0x0, 0x0)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica.go:414 +0x28a
github.com/cockroachdb/cockroach/storage.defaultProposeRaftCommandLocked(0xc42009e700, 0xc420dae000, 0x0, 0x0)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica.go:1747 +0x1ed
github.com/cockroachdb/cockroach/storage.(*Replica).proposePendingCmdLocked(0xc42009e700, 0xc420dae000, 0xc420d920c0, 0xc420d860c0)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica.go:1686 +0x77
github.com/cockroachdb/cockroach/storage.(*Replica).proposeRaftCommand(0xc42009e700, 0x7f961d71c000, 0xc420d920c0, 0x311cdad160001, 0xe, 0x0, 0x0, 0x1, 0x0, 0x0, ...)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica.go:1665 +0x1f6
github.com/cockroachdb/cockroach/storage.(*Replica).addWriteCmd(0xc42009e700, 0x7f961d71c000, 0xc420d920c0, 0x311cdad160001, 0xe, 0x0, 0x0, 0x1, 0x0, 0x0, ...)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica.go:1551 +0x17a
github.com/cockroachdb/cockroach/storage.(*Replica).Send(0xc42009e700, 0x7f961d71c000, 0xc420d920c0, 0x311cdad160001, 0xe, 0x0, 0x0, 0x1, 0x0, 0x0, ...)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica.go:1075 +0x277
github.com/cockroachdb/cockroach/storage.(*pendingLeaseRequest).InitOrJoinRequest.func1(0x7f961d71c000, 0xc420b95f50)
/home/tamird/src/github.com/cockroachdb/cockroach/storage/replica_range_lease.go:127 +0x174
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask.func1(0xc4205c2630, 0x1f100b8, 0x1e, 0x91, 0x19866d8, 0xc420d92000, 0x7f961d71c000, 0xc420b95f50)
/home/tamird/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:239 +0xc6
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask
/home/tamird/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:240 +0x227
ERROR: exit status 2
2146 runs completed, 1 failures, over 19s
FAIL
Makefile:146: recipe for target 'stress' failed
make: *** [stress] Error 1
Assigning @petermattis as the newest raft whisperer.
Can you try to reproduce on top of my RocksDB 4.11.2 PR? This is exactly the error we were seeing on RocksDB 4.9.
Reproduces just as easily on 43e2a73764c82285850efcbe30f2b5793288a889.
Thanks. Might be a locking issue then. Fun.
I reproduced this on my laptop, though it took a few minutes. Exact same error message. Time to turn on some additional logging.
Turning on some additional logging I see:
I161004 19:19:39.996002 8 storage/replica_raftstorage.go:443 [s1,r1:/M{in-ax}] generated snapshot 1735fb5a for range 1 at index 22 in 40.266µs.
I161004 19:19:39.996509 8 storage/store.go:2735 streamed snapshot: kv pairs: 42, log entries: 3
I161004 19:19:39.996638 278 raft/raft.go:542 [s2,r1:/M{in-ax}] ffffffffffffffff became follower at term 6
I161004 19:19:39.996737 278 raft/raft.go:302 [s2,r1:/M{in-ax}] newRaft ffffffffffffffff [peers: [1,2,3], term: 6, commit: 17, applied: 17, lastindex: 17, lastterm: 6]
I161004 19:19:39.996764 278 raft/raft.go:1026 [s2,r1:/M{in-ax}] ffffffffffffffff [commit: 17, lastindex: 17, lastterm: 6] starts to restore snapshot [index: 22, term: 6]
I161004 19:19:39.996779 278 raft/log.go:288 [s2,r1:/M{in-ax}] log [committed=17, applied=17, unstable.offset=18, len(unstable.Entries)=0] starts to restore snapshot [index: 22, term: 6]
I161004 19:19:39.996798 278 raft/raft.go:1036 [s2,r1:/M{in-ax}] ffffffffffffffff restored progress of 1 [next = 23, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I161004 19:19:39.996821 278 raft/raft.go:1036 [s2,r1:/M{in-ax}] ffffffffffffffff restored progress of 3 [next = 23, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I161004 19:19:39.996836 278 raft/raft.go:1003 [s2,r1:/M{in-ax}] ffffffffffffffff [commit: 22] restored snapshot [index: 22, term: 6]
I161004 19:19:39.996855 278 storage/replica_raftstorage.go:573 [s2] [n2,s2,r1:/M{in-ax}]: with replicaID 2, applying Raft snapshot at index 22 (id=1735fb5a, encoded size=16, 1 rocksdb batches, 3 log entries)
E161004 19:19:39.997819 312 raft/raft.go:1091 [s2,r1:/M{in-ax}] 2 state.commit 22 is out of range [10, 17]
This is occurring when the range is being re-added to store 2. Before being removed, the range had committed up to log index 17. I'm mildly surprised that when the range is being re-added it sees the previous last index of 17. Could the range not have been cleaned up properly? We then apply the snapshot generated at commit index 22 and the merger of that snapshot with the previous raft state triggers an assertion.
Hmm, that exact same set of log messages happens on successful runs too. So my theory is whack.
Are we missing necessary locking of Replica.raftMu
? Seems to me that Replica.withRaftGroupLocked
can read raft state when calling raft.NewRawNode
and there are various callers of withRaftGroupLocked
that don't hold Replica.raftMu
. In particular, Replica.Send
-> Replica.addWriteCmd
-> Replica.proposeRaftCommand
.
diff --git a/storage/replica.go b/storage/replica.go
index ae4fdb9..2baae8c 100644
--- a/storage/replica.go
+++ b/storage/replica.go
@@ -1650,6 +1650,8 @@ func makeIDKey() storagebase.CmdIDKey {
func (r *Replica) proposeRaftCommand(
ctx context.Context, ba roachpb.BatchRequest,
) (chan roachpb.ResponseWithError, func() bool, error) {
+ r.raftMu.Lock()
+ defer r.raftMu.Unlock()
r.mu.Lock()
defer r.mu.Unlock()
if r.mu.destroyed != nil {
Maybe the above isn't necessary. It's curious that we're trying to acquire the range lease on a replica that we're concurrently applying a snapshot to. What is trying to acquire a range lease when the replica is uninitialized?
Ok, so the range isn't uninitialized. The test is unreplicating a range from a node, truncate the range log to ensure a snapshot, re-adding the range and expecting it to catch up. While the removed range is being re-added via a snapshot, I sometimes see a concurrent range lease request.
The range lease request results in the following code path: Replica.Send
-> Replica.addWriteCmd
-> Replica.proposeRaftCommand
-> Replica.proposePendingCmdLocked
-> defaultProposeRaftCommandLocked
-> Replica.withRaftGroupLocked
-> raft.NewRawNode
.
The snapshot being applied is a "Raft" snapshot which means it is being applied via Replica.handleRaftReady
. This goroutine holds Replica.raftMu
. In Replica.applySnapshot
we see the code:
if err := batch.Commit(); err != nil {
return err
}
log.Infof(r.ctx, "committed snapshot batch")
r.mu.Lock()
log.Infof(r.ctx, "updating state after snapshot")
// We set the persisted last index to the last applied index. This is
// not a correctness issue, but means that we may have just transferred
// some entries we're about to re-request from the leader and overwrite.
// However, raft.MultiNode currently expects this behaviour, and the
// performance implications are not likely to be drastic. If our
// feelings about this ever change, we can add a LastIndex field to
// raftpb.SnapshotMetadata.
r.mu.lastIndex = s.RaftAppliedIndex
r.mu.raftLogSize = raftLogSize
// Update the range and store stats.
r.store.metrics.subtractMVCCStats(r.mu.state.Stats)
r.store.metrics.addMVCCStats(s.Stats)
r.mu.state = s
r.assertStateLocked(r.store.Engine())
r.mu.Unlock()
Notice that we commit the RocksDB batch and then update our in-memory state. Inside Replica.withRaftGroupLocked
we create the new raft.RawNode
:
raftGroup, err := raft.NewRawNode(newRaftConfig(
raft.Storage(r),
uint64(r.mu.replicaID),
r.mu.state.RaftAppliedIndex,
r.store.cfg,
&raftLogger{ctx: r.ctx},
), nil)
I added instrumentation that shows when the Raft assertion fires we've committed the RocksDB batch but not updated the in-memory state. So the on-disk "hard state" indicates we've committed up to index 22 but the in-memory state says our applied index is 17 which is invalid according to Raft. If we try to create a raft.RawNode
at this point we go boom.
Grabbing Replica.raftMu
in Replica.proposeRaftCommand
fixes the issue. I think so would using the on-disk applied index when create the raft.RawNode
.
I need to track down why the range lease is being requested on the node with the removed range. That would cause spurious elections since the Raft group is dormant on that node.
We're trying to acquire the lease on the removed range due to the periodic gossiping of the first range descriptor.
Using the on-disk applied index when creating a raft.RawNode
doesn't fix the test. Haven't yet figured out why.
@tschottdorf I'd appreciate your eyeballs on this. The change I'm proposing is:
diff --git a/storage/replica.go b/storage/replica.go
index ae4fdb9..172b685 100644
--- a/storage/replica.go
+++ b/storage/replica.go
@@ -1650,6 +1650,10 @@ func makeIDKey() storagebase.CmdIDKey {
func (r *Replica) proposeRaftCommand(
ctx context.Context, ba roachpb.BatchRequest,
) (chan roachpb.ResponseWithError, func() bool, error) {
+ // We need to hold grab the raft mutex here in case the Raft group needs to
+ // be initialized.
+ r.raftMu.Lock()
+ defer r.raftMu.Unlock()
r.mu.Lock()
defer r.mu.Unlock()
if r.mu.destroyed != nil {
As far as I can tell, this has always been a gap in our raft locking story.
That change is pretty bad concurrency-wise, right? Isn't it enough to acquire raftMu
in withRaftGroupLocked
in the branch in which we actually create the Raft group (or wherever, so that we always hold it when calling NewRawNode
)? Agreed that this was always broken (at least since we're doing lazy init).
Using the on-disk applied index when creating a raft.RawNode doesn't fix the test. Haven't yet figured out why.
Sounds that that would still be racy if the Raft group is created before Commit as then its in-mem state doesn't match the on-disk state.
That change is pretty bad concurrency-wise, right?
Well, it's not good, but I'm not seeing a significant perf impact to doing so.
Isn't it enough to acquire raftMu in withRaftGroupLocked in the branch in which we actually create the Raft group (or wherever, so that we always hold it when calling NewRawNode)?
Unfortunately we can't without violating our lock ordering: Replica.raftMu > Replica.mu
and we already hold Replica.mu
in withRaftGroupLocked
.
Sounds that that would still be racy if the Raft group is created before Commit as then its in-mem state doesn't match the on-disk state.
Perhaps we could commit the snapshot batch while holding Replica.mu
, though that changes Replica.mu
to be held across a potentially long running operation.
Perhaps we could commit the snapshot batch while holding Replica.mu, though that changes Replica.mu to be held across a potentially long running operation.
This seems to work:
diff --git a/storage/replica_raftstorage.go b/storage/replica_raftstorage.go
index 6e2435a..a624e30 100644
--- a/storage/replica_raftstorage.go
+++ b/storage/replica_raftstorage.go
@@ -652,11 +652,13 @@ func (r *Replica) applySnapshot(
r, s.RaftAppliedIndex, snap.Metadata.Index)
}
+ r.mu.Lock()
+
if err := batch.Commit(); err != nil {
+ r.mu.Unlock()
return err
}
- r.mu.Lock()
// We set the persisted last index to the last applied index. This is
// not a correctness issue, but means that we may have just transferred
// some entries we're about to re-request from the leader and overwrite.
Unfortunately we can't without violating our lock ordering: Replica.raftMu > Replica.mu and we already hold Replica.mu in withRaftGroupLocked.
Ah, that's awkward. I'm a bit skeptical that holding r.mu
across the commit is a good idea seeing that we have various locations in which the store mutex is held while serially acquiring each Replica, but have no quarrel with it otherwise, though it's hopefully less disastrous (#9479) now that #9668 is in.
Seems nontrivial to get a good solution to this. In an ideal world creating the Raft group would always be done holding raftMu
.
I have to do an audit, but Replica.proposeRaftCommand
might be the only call path where we can create a raft.RawNode
without holding Replica.raftMu
.
Audit done. The only caller of Replica.withRaftGroupLocked
that does not hold Replica.raftMu
is Replica.proposeRaftCommand
. Somewhat surprisingly, locking Replica.raftMu
in Replica.proposeRaftCommand
has almost no effect on performance. A 5min block_writer
run showed 1% lower queries/sec.
We could get tricky here and only lock Replica.raftMu
in proposeRaftCommand
if Replica.mu.internalRaftGroup == nil
. Given the very tiny effect on performance, I don't think this is worthwhile.
Sounds good, the only question I have is whether we can rearrange things so that we don't reintroduce that same problem. In other words, we would have to rename to withRaftGroupRaftMuLockedReplicaMuLocked
and we must make sure that it is clear that calling withRaftGroup
isn't allowed when holding either raftMu
or mu
, callers in which situation would have to use yet another wrapper that acquires the mutexes they don't already hold.
Do you see a better way (or are the existing callers amenable to the above)?
Replica.withRaftGroup
is often called with Replica.raftMu
held. Replica.withRaftGroupRaftMuLockedReplicaMuLocked
is a bit of a mouthful for me.
I've added commentary, but it does feel insufficient. I wonder if we should add a debug mutex variant that allows us to assert that the current goroutine holds the lock (using github.com/petermattis/goid
).
You are thinking something like repl.mu.AssertHeld()
? Yeah, that works for me. Could put that into the deadlock
build tag. You were thinking something like the below, right?
// !deadlock
type Mutex sync.Mutex
func (mu *AssertableMutex) AssertHeld() {}
// deadlock
type Mutex struct {
deadlock.Mutex
heldBy atomic.Value // goroutineID as int
}
func (mu *Mutex) AssertHeld() {
if myID, heldID := goid.ID(), mu.heldBy.Load().(int); heldID != myID {
// boom
}
}
// ...
That SGTM. I still wish we could make the conventions clear in our code to avoid relying on making that build that the primary way of informing developers that they didn't know what they were supposed to hold when they made the call.
Yes, that was what I was thinking of.
Google's C++ Mutex.AssertHeld
just checked whether the mutex was held by anyone, not necessarily the current thread. This was cheaper and allowed for some esoteric situations when you'd lock a mutex in one thread and rely on it in another thread. I don't think we'd need that flexibility now but before the current raft scheduler went in we had a similar situation with processRaftMu
, so we might want to allow for this flexibility instead of using goid
(and without goid
, it's cheap enough that we could do it on the race
build tag instead of deadlock
)
Filed #9913 to capture the AssertHeld
idea. Closing this issue as TestReplicateRestartAfterTruncationWithRemoveAndReAdd
has been deflaked.
Binary: cockroach/static-tests.tar.gz sha: https://github.com/cockroachdb/cockroach/commits/087bda1d10e970964d6885ce7b505766d113cabe
Stress build found a failed test:
Run Details:
Please assign, take a look and update the issue accordingly.