cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

roachtest: failover/chaos/read-only failed #126927

Open cockroach-teamcity opened 1 month ago

cockroach-teamcity commented 1 month ago

roachtest.failover/chaos/read-only failed with artifacts on release-24.1.2-rc @ 7e81be6de75205c3d08b0d8dcc6ca188306abc27:

(assertions.go:363).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1466
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:344
                                main/pkg/cmd/roachtest/monitor.go:120
                                golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
                                src/runtime/asm_amd64.s:1695
    Error:          Received unexpected error:
                    pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight load-value:authinfo-roachprod-2-2: context deadline exceeded
    Test:           failover/chaos/read-only
(require.go:1357).NoError: FailNow called
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #126542 roachtest: failover/chaos/read-only failed [A-testing C-bug C-test-failure O-roachtest O-robot P-3 T-kv branch-release-23.1.24-rc]

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-40187

andrewbaptist commented 1 month ago

I'm removing the release blocker - it appears to be an issue where we have two back-to-back failures on epoch leases that aren't supported. Specifically:

09:27:00 failover.go:293: chaos iteration 5
09:28:12 failover.go:343: failing n8 (blackhole-recv)
09:28:12 failover.go:343: failing n9 (deadlock)

The deadlock doesn't go through because the blackhole has left the cluster in a bad state with epoch leases. The problem is that a blackhole with epoch lease doesn't always return availability so the attempt to induce the deadlock fails.

This is a test problem where we need to disallow this combination.

Assigning myself and setting P3 as it isn't a real issue but is also hard to fix without either crippling the test for epoch leases (to only have a single failure) or manually figuring out the combinations that can't be done together in a metamorphic-like test.

arulajmani commented 1 month ago

manually figuring out the combinations that can't be done together

This sounds promising. Seems like this issue has come up another time; @andrewbaptist do you think it'll help to at least list down the incompatible combinations? Even if we don't address the issue by automatically selecting from just the compatible operations, having a list would make for quick triage.

github-actions[bot] commented 4 days ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.