cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.82k forks source link

roachtest: gossip/chaos/nodes=9 failed #135154

Open cockroach-teamcity opened 2 weeks ago

cockroach-teamcity commented 2 weeks ago

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:

(gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

- #135059 roachtest: gossip/chaos/nodes=9 failed [C-test-failure O-roachtest O-robot T-kv branch-release-24.2 release-blocker]

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-44376

tbg commented 2 weeks ago

This is, sadly, not another instance of the problem we're trying to diagnose with #134527. Instead, it seems that the gossip checks are fairly slow:

2024/11/14 09:25:09 cluster.go:702: test status: stopping nodes :9
teamcity-17753083-1731566467-55-n9cpu4: stopping
2024/11/14 09:25:11 gossip.go:122: test status: waiting for gossip to exclude dead node 9
2024/11/14 09:25:11 gossip.go:125: checking if gossip excludes dead node 9 (0s)
2024/11/14 09:25:11 gossip.go:87: 1: checking gossip
2024/11/14 09:25:11 gossip.go:91: 1: gossip not ok (dead node 9 present) (1s)
2024/11/14 09:25:11 gossip.go:140: sleeping for 1s (1s)
2024/11/14 09:25:12 gossip.go:125: checking if gossip excludes dead node 9 (2s)
2024/11/14 09:25:12 gossip.go:87: 1: checking gossip
2024/11/14 09:25:13 gossip.go:91: 1: gossip not ok (dead node 9 present) (2s)
2024/11/14 09:25:13 gossip.go:140: sleeping for 1s (2s)
2024/11/14 09:25:14 gossip.go:125: checking if gossip excludes dead node 9 (3s)
2024/11/14 09:25:14 gossip.go:87: 1: checking gossip
2024/11/14 09:25:16 gossip.go:98: 1: found 8 live nodes
2024/11/14 09:25:16 gossip.go:87: 2: checking gossip
2024/11/14 09:25:17 gossip.go:87: 3: checking gossip
2024/11/14 09:25:19 gossip.go:87: 4: checking gossip
2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip
2024/11/14 09:25:28 gossip.go:87: 7: checking gossip
2024/11/14 09:25:31 test_impl.go:460: test failure #1: full stack retained in failure_1.log: (gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s

Note for example the 5s delay between those two lines:

2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip

The test wants to assert that gossip converges within 20s, but it runs a lot of checks that it assumes take much less than 20s. We're seeing here that they can take 20s in their own right.

I wouldn't call this expected, though. Here's the code that runs in the check:

https://github.com/cockroachdb/cockroach/blob/39e43b85ec3b02bc760df10fce1c19d09419d6f2/pkg/cmd/roachtest/tests/gossip.go#L87-L115

Nothing here should take appreciable amounts of time (but consider that the problem in https://github.com/cockroachdb/cockroach/pull/134527 might well boil down to slow t.L().Printf) and neither should this query on which the check is based:

https://github.com/cockroachdb/cockroach/blob/39e43b85ec3b02bc760df10fce1c19d09419d6f2/pkg/cmd/roachtest/tests/gossip.go#L50-L55

But we just restarted a node, so I can understand how logging into a SQL shell might take some time (on the order of 6s, the lease interval - we're killing nodes here). But it also reliably seems to be taking around 1s at least to run this query, and why would one that isn't the first one suddenly see a lease failover? Running the query on a three node roachprod cluster I had laying around (and which is at >90% CPU) shows that in general that query executes ~instantly, as it should, since it only reads node-local data.

tbg commented 2 weeks ago

Removing release-blocker label, as I consider it unlikely that there's anything new here. This test has been flaky for a long time.

tbg commented 2 weeks ago

When I ran this test manually, I don't see these pathological timings. 3s for everything isn't nothing, but it's a lot less than 20s.

2024/11/15 11:17:33 cluster.go:701: test status: stopping nodes :5
tobias-1731669232-01-n9cpu4: stopping
2024/11/15 11:17:35 gossip.go:134: test status: waiting for gossip to exclude dead n
ode 5
2024/11/15 11:17:35 gossip.go:137: checking if gossip excludes dead node 5 (0s)
2024/11/15 11:17:35 gossip.go:99: 1: checking gossip
2024/11/15 11:17:35 gossip.go:103: 1: gossip not ok (dead node 5 present) (0s)
2024/11/15 11:17:35 gossip.go:152: sleeping for 1s (0s)
2024/11/15 11:17:36 gossip.go:137: checking if gossip excludes dead node 5 (1s)
2024/11/15 11:17:36 gossip.go:99: 1: checking gossip
2024/11/15 11:17:36 gossip.go:103: 1: gossip not ok (dead node 5 present) (1s)
2024/11/15 11:17:36 gossip.go:152: sleeping for 1s (1s)
2024/11/15 11:17:37 gossip.go:137: checking if gossip excludes dead node 5 (2s)
2024/11/15 11:17:37 gossip.go:99: 1: checking gossip
2024/11/15 11:17:37 gossip.go:110: 1: found 8 live nodes
2024/11/15 11:17:37 gossip.go:99: 2: checking gossip
2024/11/15 11:17:37 gossip.go:99: 3: checking gossip
2024/11/15 11:17:37 gossip.go:99: 4: checking gossip
2024/11/15 11:17:37 gossip.go:99: 6: checking gossip
2024/11/15 11:17:37 gossip.go:99: 7: checking gossip
2024/11/15 11:17:37 gossip.go:99: 8: checking gossip
2024/11/15 11:17:37 gossip.go:99: 9: checking gossip
2024/11/15 11:17:38 gossip.go:129: gossip ok (size: 8) (3s)
kvoli commented 1 week ago

Reading from above, guessing this is similar to the other failure on a different branch, I'll match the labels rather than marks as a duplicate just in case: https://github.com/cockroachdb/cockroach/issues/135059.