cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

roachtest: rebalance/by-load/replicas failed #135811

Open cockroach-teamcity opened 5 days ago

cockroach-teamcity commented 5 days ago

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.rebalance/by-load/replicas failed with artifacts on release-24.1 @ ed52acc6329e0dfa20e7e8a13dc47e959a65548c:

(assertions.go:363).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/rebalance_load.go:122
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/rebalance_load.go:177
                                main/pkg/cmd/roachtest/test_runner.go:1223
                                src/runtime/asm_amd64.s:1695
    Error:          Received unexpected error:
                    CPU not evenly balanced after timeout: outside bounds mean=77.8 tolerance=20.0% (±15.6) bounds=[62.3, 93.4]
                    (1) attached stack trace
                      -- stack trace:
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.rebalanceByLoad.func2
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/rebalance_load.go:309
                      | golang.org/x/sync/errgroup.(*Group).Go.func1
                      |     golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
                      | runtime.goexit
                      |     src/runtime/asm_amd64.s:1695
                    Wraps: (2) CPU not evenly balanced after timeout: outside bounds mean=77.8 tolerance=20.0% (±15.6) bounds=[62.3, 93.4]
                      |     below  = []
                      |     within = [s1: 76 (-1.5%), s2: 75 (-2.6%), s3: 74 (-4.1%), s4: 75 (-3.2%), s5: 71 (-8.7%)]
                      |     above  = [s6: 93 (+20.0%)]
                    Error types: (1) *withstack.withStack (2) *errutil.leafError
    Test:           rebalance/by-load/replicas
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/rebalance/by-load/replicas/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-44730

kvoli commented 4 days ago

This is right at the threshold:

CPU not evenly balanced after timeout: outside bounds mean=77.8 tolerance=20.0% (±15.6) bounds=[62.3, 93.4]
                      |     below  = []
                      |     within = [s1: 76 (-1.5%), s2: 75 (-2.6%), s3: 74 (-4.1%), s4: 75 (-3.2%), s5: 71 (-8.7%)]
                      |     above  = [s6: 93 (+20.0%)]

I'll take a look, we should have a CPU profile somewhere.

kvoli commented 4 days ago

The replica CPU was being controlled as expected:

image

But the actual CPU was not: image

When looking at a CPU profile from n6 it makes sense why the replica CPU could be within bounds while the process CPU is not quite:

image

Presumably, the SQL load here is not balanced. Unfortunately, the other nodes didn't have CPU profiles available to diff against.

I don't see any value in investigating it further, except to motivate using a different metric for the test.