cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.51k stars 3.7k forks source link

kv/kvserver: TestStoreRangeUpReplicate failed #117200

Open cockroach-teamcity opened 6 months ago

cockroach-teamcity commented 6 months ago

kv/kvserver.TestStoreRangeUpReplicate failed on master @ c316d6a615fa02a05357a20bf03b8a197ab27810:

=== RUN   TestStoreRangeUpReplicate
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestStoreRangeUpReplicate481718501
    test_log_scope.go:81: use -show-logs to present logs inline
    client_raft_test.go:2111: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_raft_test.go:2111
            Error:          Not equal: 
                            expected: 134
                            actual  : 135
            Test:           TestStoreRangeUpReplicate
    panic.go:541: -- test log scope end --
test logs left over in: outputs.zip/logTestStoreRangeUpReplicate481718501
--- FAIL: TestStoreRangeUpReplicate (250.64s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-35053

miraradeva commented 5 months ago

This reproduces easily on a GCE worker under race and stress. It fails in one of two ways:

  1. More replicas than expected (like the failure in this issue); expected: 134, actual: 135. In the logs, it's usually one system range (r3 or r4) that has one extra replica than what we expected. We expect 2 new replicas per range, but for some reason one system range got 3 replicas.
  2. Fewer replicas than expected, expected: 136, actual: 134. Each range (between r1 and r67, inclusive) has 2 additional replicas, as expected. But after visiting the nodes, we expect 68 ranges instead, and r68 seems to be the result of a later load-based split. Disabling load-based splitting didn't help.

In either case, it seems like a test issue.

kvoli commented 5 months ago

More replicas than expected (like the failure in this issue); expected: 134, actual: 135. In the logs, it's usually one system range (r3 or r4) that has one extra replica than what we expected. We expect 2 new replicas per range, but for some reason one system range got 3 replicas.

I also reproduced this, it appears due to a change replicas error from a descriptor changing:

Details

```python # n1 tries adding n2 as a voter (currently n1,n3) 7 13@kv/kvserver/allocator/plan/replicate.go:452 ⋮ [T1,Vsystem,n1,replicate,s1,r30/1:‹/Table/2{7-8}›] 1095 adding voter n2,s2: [1*:121, 2:121] 7 13@kv/kvserver/replica_command.go:2379 ⋮ [T1,Vsystem,n1,replicate,s1,r30/1:‹/Table/2{7-8}›] 1096 change replicas (add [(n2,s2):3LEARNER] remove []): existing descriptor r30:‹/Table/2{7-8}› [(n1,s1):1, (n3,s3):2, next=3, gen=2] 7 13@kv/kvserver/replica_raft.go:388 ⋮ [T1,Vsystem,n1,s1,r30/1:‹/Table/2{7-8}›] 1097 proposing SIMPLE(l3) [(n2,s2):3LEARNER]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):3LEARNER] next=4 50777 13@kv/kvserver/store_snapshot.go:1872 ⋮ [T1,Vsystem,n1,s1,r30/1:‹/Table/2{7-8}›] 1098 streamed snapshot f323dad0 at applied index 123 to (n2,s2):3LEARNER with 961 B in 0.30s @ 3.2 KiB/s: kvs=18 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 0.10s 50752 13@kv/kvserver/replica_raftstorage.go:583 ⋮ [T1,Vsystem,n2,s2,r30/3:{-}] 1099 applying snapshot f323dad0 from (n1,s1):1 at applied index 123 # n3 removes n2, it must have been transferred the lease (?!) and immediately # removes any learners it sees. 51345 13@kv/kvserver/replica_command.go:2379 ⋮ [T1,Vsystem,n3,replicate,s3,r30/2:‹/Table/2{7-8}›] 1100 change replicas (add [] remove [(n2,s2):3LEARNER]): existing descriptor r30:‹/Table/2{7-8}› [(n1,s1):1, (n3,s3):2, (n2,s2):3LEARNER, next=4, gen=3] 51345 13@kv/kvserver/replica_raft.go:388 ⋮ [T1,Vsystem,n3,s3,r30/2:‹/Table/2{7-8}›] 1101 proposing SIMPLE(r3) [(n2,s2):3LEARNER]: after=[(n1,s1):1 (n3,s3):2] next=4 # n1's change replicas fails as a result due to a descriptor mismatch. 7 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r30/1:‹/Table/2{7-8}›] 1102 error processing replica: change replicas of r30 failed: descriptor changed: [expected] r30:‹/Table/2{7-8}› [(n1,s1):1, (n3,s3):2, (n2,s2):3LEARNER, next=4, gen=3] != [actual] r30:‹/Table/2{7-8> # n3 up-replicates the range, adding n2 as a voter. initial snapshots is 1 # greater than expected from the descriptor mismatch. 51546 13@kv/kvserver/allocator/plan/replicate.go:452 ⋮ [T1,Vsystem,n3,replicate,s3,r30/2:‹/Table/2{7-8}›] 1103 adding voter n2,s2: [1:131, 2*:131] 51546 13@kv/kvserver/replica_command.go:2379 ⋮ [T1,Vsystem,n3,replicate,s3,r30/2:‹/Table/2{7-8}›] 1105 change replicas (add [(n2,s2):4LEARNER] remove []): existing descriptor r30:‹/Table/2{7-8}› [(n1,s1):1, (n3,s3):2, next=4, gen=4] 51546 13@kv/kvserver/replica_raft.go:388 ⋮ [T1,Vsystem,n3,s3,r30/2:‹/Table/2{7-8}›] 1108 proposing SIMPLE(l4) [(n2,s2):4LEARNER]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):4LEARNER] next=5 52079 13@kv/kvserver/store_snapshot.go:1872 ⋮ [T1,Vsystem,n3,s3,r30/2:‹/Table/2{7-8}›] 1112 streamed snapshot 929bb130 at applied index 133 to (n2,s2):4LEARNER with 1.2 KiB in 0.29s @ 4.3 KiB/s: kvs=21 rangeKVs=0 sharedSSTs=0, rate-limit: 32 MiB/s, queued: 3.50s 51997 13@kv/kvserver/replica_raftstorage.go:583 ⋮ [T1,Vsystem,n2,s2,r30/4:{-}] 1113 applying snapshot 929bb130 from (n3,s3):2 at applied index 133 7 13@kv/kvserver/replica_raft.go:388 ⋮ [T1,Vsystem,n1,s1,r7/1:‹/Table/{3-4}›] 1114 proposing SIMPLE(v3) [(n2,s2):3]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):3] next=4 51546 13@kv/kvserver/replica_command.go:2379 ⋮ [T1,Vsystem,n3,replicate,s3,r30/2:‹/Table/2{7-8}›] 1115 change replicas (add [(n2,s2):4] remove []): existing descriptor r30:‹/Table/2{7-8}› [(n1,s1):1, (n3,s3):2, (n2,s2):4LEARNER, next=5, gen=5] 51546 13@kv/kvserver/replica_raft.go:388 ⋮ [T1,Vsystem,n3,s3,r30/2:‹/Table/2{7-8}›] 1118 proposing SIMPLE(v4) [(n2,s2):4]: after=[(n1,s1):1 (n3,s3):2 (n2,s2):4] next=5 ```

I don't see a lease transfer anywhere in the logs for this range, which is odd -- but again just a testing issue.

Fewer replicas than expected, expected: 136, actual: 134. Each range (between r1 and r67, inclusive) has 2 additional replicas, as expected. But after visiting the nodes, we expect 68 ranges instead, and r68 seems to be the result of a later load-based split. Disabling load-based splitting didn't help.

Could this be due to span config splits? I see a few splits + merges when this failed with a similar count.

Marking as a testing issue and assigning P3.