Open cockroach-teamcity opened 2 weeks ago
This is, sadly, not another instance of the problem we're trying to diagnose with #134527. Instead, it seems that the gossip checks are fairly slow:
2024/11/14 09:25:09 cluster.go:702: test status: stopping nodes :9
teamcity-17753083-1731566467-55-n9cpu4: stopping
2024/11/14 09:25:11 gossip.go:122: test status: waiting for gossip to exclude dead node 9
2024/11/14 09:25:11 gossip.go:125: checking if gossip excludes dead node 9 (0s)
2024/11/14 09:25:11 gossip.go:87: 1: checking gossip
2024/11/14 09:25:11 gossip.go:91: 1: gossip not ok (dead node 9 present) (1s)
2024/11/14 09:25:11 gossip.go:140: sleeping for 1s (1s)
2024/11/14 09:25:12 gossip.go:125: checking if gossip excludes dead node 9 (2s)
2024/11/14 09:25:12 gossip.go:87: 1: checking gossip
2024/11/14 09:25:13 gossip.go:91: 1: gossip not ok (dead node 9 present) (2s)
2024/11/14 09:25:13 gossip.go:140: sleeping for 1s (2s)
2024/11/14 09:25:14 gossip.go:125: checking if gossip excludes dead node 9 (3s)
2024/11/14 09:25:14 gossip.go:87: 1: checking gossip
2024/11/14 09:25:16 gossip.go:98: 1: found 8 live nodes
2024/11/14 09:25:16 gossip.go:87: 2: checking gossip
2024/11/14 09:25:17 gossip.go:87: 3: checking gossip
2024/11/14 09:25:19 gossip.go:87: 4: checking gossip
2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip
2024/11/14 09:25:28 gossip.go:87: 7: checking gossip
2024/11/14 09:25:31 test_impl.go:460: test failure #1: full stack retained in failure_1.log: (gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s
Note for example the 5s delay between those two lines:
2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip
The test wants to assert that gossip converges within 20s, but it runs a lot of checks that it assumes take much less than 20s. We're seeing here that they can take 20s in their own right.
I wouldn't call this expected, though. Here's the code that runs in the check:
Nothing here should take appreciable amounts of time (but consider that the problem in https://github.com/cockroachdb/cockroach/pull/134527 might well boil down to slow t.L().Printf
) and neither should this query on which the check is based:
But we just restarted a node, so I can understand how logging into a SQL shell might take some time (on the order of 6s, the lease interval - we're killing nodes here). But it also reliably seems to be taking around 1s at least to run this query, and why would one that isn't the first one suddenly see a lease failover? Running the query on a three node roachprod cluster I had laying around (and which is at >90% CPU) shows that in general that query executes ~instantly, as it should, since it only reads node-local data.
Removing release-blocker
label, as I consider it unlikely that there's anything new here. This test has been flaky for a long time.
When I ran this test manually, I don't see these pathological timings. 3s for everything isn't nothing, but it's a lot less than 20s.
2024/11/15 11:17:33 cluster.go:701: test status: stopping nodes :5
tobias-1731669232-01-n9cpu4: stopping
2024/11/15 11:17:35 gossip.go:134: test status: waiting for gossip to exclude dead n
ode 5
2024/11/15 11:17:35 gossip.go:137: checking if gossip excludes dead node 5 (0s)
2024/11/15 11:17:35 gossip.go:99: 1: checking gossip
2024/11/15 11:17:35 gossip.go:103: 1: gossip not ok (dead node 5 present) (0s)
2024/11/15 11:17:35 gossip.go:152: sleeping for 1s (0s)
2024/11/15 11:17:36 gossip.go:137: checking if gossip excludes dead node 5 (1s)
2024/11/15 11:17:36 gossip.go:99: 1: checking gossip
2024/11/15 11:17:36 gossip.go:103: 1: gossip not ok (dead node 5 present) (1s)
2024/11/15 11:17:36 gossip.go:152: sleeping for 1s (1s)
2024/11/15 11:17:37 gossip.go:137: checking if gossip excludes dead node 5 (2s)
2024/11/15 11:17:37 gossip.go:99: 1: checking gossip
2024/11/15 11:17:37 gossip.go:110: 1: found 8 live nodes
2024/11/15 11:17:37 gossip.go:99: 2: checking gossip
2024/11/15 11:17:37 gossip.go:99: 3: checking gossip
2024/11/15 11:17:37 gossip.go:99: 4: checking gossip
2024/11/15 11:17:37 gossip.go:99: 6: checking gossip
2024/11/15 11:17:37 gossip.go:99: 7: checking gossip
2024/11/15 11:17:37 gossip.go:99: 8: checking gossip
2024/11/15 11:17:37 gossip.go:99: 9: checking gossip
2024/11/15 11:17:38 gossip.go:129: gossip ok (size: 8) (3s)
Reading from above, guessing this is similar to the other failure on a different branch, I'll match the labels rather than marks as a duplicate just in case: https://github.com/cockroachdb/cockroach/issues/135059.
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:
Parameters:
arch=arm64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=expiration
runtimeAssertionsBuild=false
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Same failure on other branches
- #135059 roachtest: gossip/chaos/nodes=9 failed [C-test-failure O-roachtest O-robot T-kv branch-release-24.2 release-blocker]
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!
Jira issue: CRDB-44376