cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.85k stars 3.77k forks source link

roachtest: failover/chaos/read-write failed #125681

Closed cockroach-teamcity closed 1 month ago

cockroach-teamcity commented 2 months ago

roachtest.failover/chaos/read-write failed with artifacts on release-24.1 @ cdc755b976cc21251589de157e8180189c061f1b:

(disk_stall.go:479).Setup: full command output in run_125038.271143008_n1-10_echo-0-sudo-blockdev.log: COMMAND_PROBLEM: exit status 1
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-write/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-39555

arulajmani commented 2 months ago

Failed with:

run_125038.271143008_n1-10_echo-0-sudo-blockdev: 12:50:38 cluster.go:2419: > echo "0 $(sudo blockdev --getsz /dev/sdb) linear /dev/sdb 0" | sudo dmsetup create data1
teamcity-15654067-1718344206-84-n10cpu2:[1 2 3 4 5 6 7 8 9 10]: echo "0 $(sudo blockdev --g...
   6:   <err> COMMAND_PROBLEM: exit status 1

Pattern matching on the failure mode, Im surprised that this failed even though we've backported https://github.com/cockroachdb/cockroach/pull/123782 to 24.1. @itsbilal would you mind taking a look?

itsbilal commented 2 months ago

I think this is just a race in the dmsetup disk staller between the umount -f /mnt/data1 || true returning immediately due to the || true, and the underlying umount not running until after the dmsetup create data1 command.

https://github.com/cockroachdb/cockroach/blob/97de946a474bd3d0dc5ccbe40d8b9449ae14ff72/pkg/cmd/roachtest/roachtestutil/disk_stall.go#L193

We should remove the || true, not sure why it's there.

renatolabs commented 2 months ago

returning immediately due to the || true

No context here, but a drive-by comment that || truedoes not make command return immediately; a || b will wait for a to return before running b.

itsbilal commented 2 months ago

@renatolabs that's correct, thanks for correcting that.

github-actions[bot] commented 1 month ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.