roachtest: failover/partial/lease-leader/lease=leader failed

cockroach-teamcity commented 3 days ago

roachtest.failover/partial/lease-leader/lease=leader failed with artifacts on master @ f717f6bd218121bb5e3376af658545f6bff30c22:

(failover.go:1815).sleepFor: sleep failed: context canceled
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
unexpected node event: n6: cockroach process for system interface died (exit code 7)
(cluster.go:2456).Run: context canceled
test artifacts and logs in: /artifacts/failover/partial/lease-leader/lease=leader/run_1

Parameters:

arch=amd64
cloud=azure
coverageBuild=false
cpu=2
encrypted=false
runtimeAssertionsBuild=false
ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-44862}

blathers-crl[bot] commented 1 day ago

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

andrewbaptist commented 1 day ago

I'm going to close this as an infra-flake and dupe of #97968, but I'm not 100% sure we aren't doing something wrong.

storage/pebble.go:1585 â‹® [T1,n6,s6,pebble] 7746 disk stall detected: disk slowness detected: syncdata on file 000081.log has been ongoing for 20.0s

It started reporting slowness at:
08:53:47.749490 454201

Note that this test only impacts the network, so should not impact the disk connectivity. Its possible that the large writing causes disk stalls because we are overloading the disk.

This failed in a similar way to #136081 as wel (both are network partitions using leader leases that caused disk stalls)

cockroachdb / cockroach

roachtest: failover/partial/lease-leader/lease=leader failed #136082