cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.01k stars 3.79k forks source link

kv/kvserver: TestReplicateQueueRebalanceMultiStore failed #117796

Closed cockroach-teamcity closed 8 months ago

cockroach-teamcity commented 9 months ago

kv/kvserver.TestReplicateQueueRebalanceMultiStore failed on master @ 75f19bec66a542d58680e9ccb47d57598f48fe20:

    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 32 30 33 31 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 31 30 33 32 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 31 30 33 33 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 30 30 33 33 29]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [17 21 7 9 3 7 11 7]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [31 34 35 29 29 33 33 30]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [17 21 7 9 3 7 11 8]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [32 34 35 29 28 33 33 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [17 21 7 9 3 7 11 8]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 33 34 29 28 33 33 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [17 20 7 8 5 7 10 9]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 33 33 29 31 33 32 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 20 6 8 6 7 9 9]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 33 32 29 32 33 32 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 19 6 8 8 7 8 9]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 32 31 29 33 33 32 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 19 6 8 8 7 8 9]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 31 31 29 35 33 31 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 19 6 8 9 6 8 9]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 31 31 29 35 33 31 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 19 6 8 9 6 8 9]
    replicate_queue_test.go:329: current replica state (want at least 28 replicas on all stores): [34 30 31 29 35 30 31 31]
    replicate_queue_test.go:340: current lease state (want at least 7 leases on all stores): [18 18 7 8 9 7 8 8]
    replicate_queue_test.go:355: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:355
            Error:          Received unexpected error:
                            pq: inbox communication error: rpc error: code = Canceled desc = context canceled
            Test:           TestReplicateQueueRebalanceMultiStore/multi-store
    --- FAIL: TestReplicateQueueRebalanceMultiStore/multi-store (450.47s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-35361

cockroach-teamcity commented 9 months ago

kv/kvserver.TestReplicateQueueRebalanceMultiStore failed on master @ 1898adc059e9b3026fd5a7ed88dc41cbb006abfc:

=== RUN   TestReplicateQueueRebalanceMultiStore
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestReplicateQueueRebalanceMultiStore2168183203
    test_log_scope.go:81: use -show-logs to present logs inline
    replicate_queue_test.go:367: -- test log scope end --
test logs left over in: outputs.zip/logTestReplicateQueueRebalanceMultiStore2168183203
--- FAIL: TestReplicateQueueRebalanceMultiStore (649.34s)
=== RUN   TestReplicateQueueRebalanceMultiStore/multi-store
    replicate_queue_test.go:273: split off r68:/{Table/100-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=9, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r69:/{Table/101-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=10, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r70:/{Table/102-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=11, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r71:/{Table/103-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=12, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r72:/{Table/104-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=13, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r73:/{Table/105-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=14, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r74:/{Table/106-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=15, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r75:/{Table/107-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=16, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r76:/{Table/108-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=17, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r77:/{Table/109-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=18, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r78:/{Table/110-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=19, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r79:/{Table/111-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=20, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r80:/{Table/112-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=21, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r81:/{Table/113-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=22, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:273: split off r82:/{Table/114-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=23, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:259: condition failed to evaluate within 45s: from testserver.go:2072: /Table/115: split unexpected error: replica unavailable: (n3,s6):3 unable to serve request to r82:/{Table/114-Max} [(n4,s7):1, (n2,s4):2, (n3,s6):3, next=4, gen=23, sticky=9223372036.854775807,2147483647]: closed timestamp: 1705471598.729411517,0 (2024-01-17 06:06:38); raft status: {"id":"3","term":64,"vote":"3","commit":24,"lead":"3","raftState":"StateLeader","applied":24,"progress":{"1":{"match":0,"next":40,"state":"StateProbe"},"2":{"match":0,"next":40,"state":"StateProbe"},"3":{"match":40,"next":41,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 61.00s for slow proposal RequestLease [/Table/114]
    --- FAIL: TestReplicateQueueRebalanceMultiStore/multi-store (604.46s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

kvoli commented 8 months ago

This is one of the heaviest tests in the kvserver suite and the test appears to be struggling with resources. The test cluster setup fails due to an unavailable ranges, which appear to be the result of chronic overload (disk writes stalling):

W240117 06:15:37.883988 10378317 kv/kvserver/liveness/liveness.go:752 â<8b>® [T1,Vsystem,n1,s2,r52/4:â<80>¹/Table/5{0-1}â<80>º] 42143  slow heartbeat took 4m52.898111002s; err=disk write failed while updating node liveness: node unavailable; try another peer

I'm going to mark this as a test issue and P3, insufficient resources for the test size.

We should not run this test under --stress. I'll open a PR.