cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.08k stars 3.8k forks source link

roachtest: kv/splits/nodes=3/quiesce=true/lease=epoch failed #127721

Closed cockroach-teamcity closed 1 month ago

cockroach-teamcity commented 3 months ago

roachtest.kv/splits/nodes=3/quiesce=true/lease=epoch failed with artifacts on master @ 7fb362dd5aa6e85d65c4c89f208c5bed51ab9692:

(cluster.go:2456).Run: full command output in run_060135.737011402_n4_workload-run-kv-init.log: COMMAND_PROBLEM: exit status 1
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/lease=epoch/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-40568

kvoli commented 3 months ago

Looks overloaded based off this slack response (internal) here

Error: executing ALTER TABLE kv SPLIT AT VALUES (-6290595461644587904): pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 11.653s (given timeout 10s): txn exec: context deadline exceeded

And:

I240726 06:39:57.913222 264 kv/kvserver/store_raft.go:696 ⋮ [T1,Vsystem,n1,s1,r112645/1:‹/Table/106/1/-6137{979…-856…}›,raft] 9029  raft ready handling: 0.75s [append=0.00s, apply=0.75s, , other=0.00s], wrote [apply=1.4 KiB (1)], state_assertions=1; node might be overloaded

Interestingly, we see:

I240726 06:36:12.867248 251 kv/kvserver/replica_proposal_buf.go:670 ⋮ [T1,Vsystem,n1,s1,r26728/1:‹/Table/106/1/-4709{468…-284…}›,raft] 5611  campaigning because Raft leader (id=2) not live in node liveness map
...
I240726 06:37:40.208126 250 kv/kvserver/replica_proposal_buf.go:670 ⋮ [T1,Vsystem,n1,s1,r158504/1:‹/Table/106/1/2069{7485…-8100…}›,raft] 6691  campaigning because Raft leader (id=3) not live in node liveness map

I'm going to chalk this up to overload, we could probably do something to reduce metrics CPU when updating the replication guages:

image

github-actions[bot] commented 2 months ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.