cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

roachtest: failover/partial/lease-leader/lease=leader failed #136082

Open cockroach-teamcity opened 3 days ago

cockroach-teamcity commented 3 days ago

roachtest.failover/partial/lease-leader/lease=leader failed with artifacts on master @ f717f6bd218121bb5e3376af658545f6bff30c22:

(failover.go:1815).sleepFor: sleep failed: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
unexpected node event: n6: cockroach process for system interface died (exit code 7)
(cluster.go:2456).Run: context canceled
test artifacts and logs in: /artifacts/failover/partial/lease-leader/lease=leader/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-44862

blathers-crl[bot] commented 1 day ago

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

andrewbaptist commented 1 day ago

I'm going to close this as an infra-flake and dupe of #97968, but I'm not 100% sure we aren't doing something wrong.

storage/pebble.go:1585 â‹® [T1,n6,s6,pebble] 7746 disk stall detected: disk slowness detected: syncdata on file 000081.log has been ongoing for 20.0s

It started reporting slowness at:
08:53:47.749490 454201

Note that this test only impacts the network, so should not impact the disk connectivity. Its possible that the large writing causes disk stalls because we are overloading the disk.

This failed in a similar way to #136081 as wel (both are network partitions using leader leases that caused disk stalls)