cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

roachtest: multitenant-multiregion failed #130322

Closed cockroach-teamcity closed 2 weeks ago

cockroach-teamcity commented 2 months ago

roachtest.multitenant-multiregion failed with artifacts on release-24.2 @ 544673f4fc982fbb3abdd2824f09282f668d2959:

(test_runner.go:1284).runTest: test timed out (20m0s)
test artifacts and logs in: /artifacts/multitenant-multiregion/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-41997

fqazi commented 2 months ago

Lets see if this one re-occurs from the logs we see the rangefeed on the descriptors table is stuck as well:

I240909 10:48:58.037077 33072 kv/kvserver/liveness/liveness.go:1069 ⋮ [T1,Vsystem,n1,s1,r88/1:‹/Tenant/123/Table/3{4-9/2…}›] 4561  retrying liveness update after ‹liveness.errRetryLiveness›: result is ambiguous: error=replica unavailable: (n4,s4):3 unable to serve request to r2:‹/System/NodeLiveness{-Max}› [(n9,s9):7, (n6,s6):6, (n4,s4):3, (n7,s7):4, (n8,s8):5, next=8, gen=16]: lost quorum (down: (n9,s9):7,(n7,s7):4,(n8,s8):5); closed timestamp: 1725877884.794685527,0 (2024-09-09 10:31:24); raft status: {"id":"3","term":9,"vote":"5","commit":533,"lead":"0","raftState":"StatePreCandidate","applied":533,"progress":{},"leadtransferee":"0"}: have been waiting 63.00s for slow proposal RequestLease [/System/NodeLiveness] [exhausted] (last error: ‹failed to send RPC›: sending to all replicas failed; last error: failed to connect to n9 at ‹10.142.0.220:26257›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.0.220:26257: connect: connection refused"› [code 14/Unavailable])

I240909 10:48:58.905181 326 kv/kvserver/replica_rangefeed.go:869 ⋮ [T1,Vsystem,n1,s1,r7/1:‹/Table/{3-4}›] 4562  RangeFeed closed timestamp 1725877885.052760105,0 is behind by 17m33.852415628s
github-actions[bot] commented 3 weeks ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.