Open cockroach-teamcity opened 1 month ago
The following disk stall caused the problem
13:42:42 disk_stall.go:128: test status: pausing 18.999452787s before next simulated disk stall on n1
13:43:01 cluster.go:2369: running cmd sudo dmsetup suspend --nofl...
on nodes [:1]; details in run_134301.673602484_n1_sudo-dmsetup-suspend.log
13:43:32 cluster.go:2369: running cmd sudo dmsetup resume data1
on nodes [:1]; details in run_134332.310261019_n1_sudo-dmsetup-resume-.log
13:43:42 disk_stall.go:128: test status: pausing 9m18.999550257s before next simulated disk stall on n1
The pmax of observed WAL fsync latency is not unexpectedly high.
Unlike https://github.com/cockroachdb/cockroach/issues/122364, this is not encrypted FS. KV p99 is 10+s and n1 lost its leases:
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:
(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
The p99 goroutine scheduling latency is low, and the number of runnable goroutines is also low, but we have slot exhaustion. Which must be a side-effect of request processing being stuck. We don't have latency histograms for the storage read path, which would allow us to narrow down.
We could enhance the test to monitor the used slots on n1 and if it exceeds 500, take a goroutine dump, so we know where those goroutines are stuck.
Never mind. We have a dump.
There are > 2500 goroutines stuck waiting on [1]
<-v.loaded
in tableCacheShard
, so they are waiting for the cache value (*Reader
etc.) to be loaded from disk.
There are 3 goroutines that are trying to do the load
goroutine 2557115 [runnable]:
runtime/pprof.Do({0x7d2ee08?, 0xca9b180?}, {{0xc00138ed00?, 0x0?, 0x0?}}, 0xc00a4d82e8)
GOROOT/src/runtime/pprof/runtime.go:52 +0xad
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x603b?}, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:931 +0x43c
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
Which is this code https://github.com/cockroachdb/pebble/blob/c077435aef13d640fc599ad01d10455f68510d1a/table_cache.go#L931-L933
But they are in the runtime package here https://github.com/golang/go/blob/master/src/runtime/pprof/runtime.go#L47-L52 which (since this is at the closing brace of the function) looks like defer SetGoroutineLabels(ctx)
, which calls into https://github.com/golang/go/blob/master/src/runtime/proflabel.go#L12-L35. Unclear why this should be stuck here.
[1]
goroutine 2942846 [runnable]:
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x6a03?}, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:862 +0x10f
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc0036446e0, {0x7d30150, 0xc0e3361a70}, 0xc03fc99a00, 0xc0e301b170, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:505 +0x97
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x7f82ead0a064?, {0x7d30150?, 0xc0e3361a70?}, 0x115a6ca?, 0xc00f372b40?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:219 +0x9e
github.com/cockroachdb/pebble.tableNewIters.TODO(0xd?, {0x7d30150?, 0xc0e3361a70?}, 0x125b0e5?, 0xc0e301b2d8?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}})
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:90 +0x6e
github.com/cockroachdb/pebble.(*levelIter).loadFile(0xc0e301b108, 0xc03fc99a00, 0x1)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:647 +0x2d1
github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc0e301b108, {0xc0a5d23140, 0xd, 0xd}, {0xc0a5d23670, 0xd, 0xd}, 0x0)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:725 +0xb7
github.com/cockroachdb/pebble.(*mergingIter).seekGE(0xc0e301ac00, {0xc0a5d23670?, 0x0?, 0x17d0e7acb5255d37?}, 0x0?, 0x0)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1035 +0xf1
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGEStrict(0xc0e301ac00, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0x1000121ebcf?, 0x7f80187fa960?}, 0x10?)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1113 +0x65
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE(0xc01993b008?, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0xc008c0e9b8?, 0x4b6789?}, 0xd?)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1104 +0x1d
github.com/cockroachdb/pebble.(*lazyCombinedIter).SeekPrefixGE(0xc0e301aa68, {0xc0a5d23140?, 0x0?, 0xd?}, {0xc0a5d23670, 0xd, 0xd}, 0x10?)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/range_keys.go:617 +0x57
github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE(0xc0e301a608, {0xc0a5d23670, 0xd, 0xd})
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:1492 +0x564
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE(0xc08ae59858, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}})
github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:373 +0xc5
github.com/cockroachdb/cockroach/pkg/storage.mvccGetMetadata({0x7db7b60, 0xc08ae59858}, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}}, 0xc0e30db440)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1656 +0xa5
github.com/cockroachdb/cockroach/pkg/storage.mvccPutInternal({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2277 +0x3b4
github.com/cockroachdb/cockroach/pkg/storage.mvccPutUsingIter({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2032 +0x1f6
github.com/cockroachdb/cockroach/pkg/storage.MVCCPut({_, _}, {_, _}, {_, _, _}, {_, _}, {{0xc0e3428900, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1939 +0x494
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.Put({_, _}, {_, _}, {{0x7dd6d08, 0xc0e33070c0}, {{0x17d0e7afce337fe9, 0x0}, 0x0, {0x17d0e7afce378f6f, ...}, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_put.go:74 +0x391
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x17d0e7afce337fe9, 0x0}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:488 +0x3a9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc0e33698c0, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:305 +0xa5d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:736 +0x1d6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:703 +0x359
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluate1PC(_, {_, _}, {_, _}, _, _, _)
The test failure in https://github.com/cockroachdb/cockroach/issues/124399#issuecomment-2120569707 does not have a goroutine dump from the stall corresponding to the failure. I have not yet looked at the metrics.
14:24:38 test_runner.go:1098: ##teamcity[testFailed name='disk-stalled/wal-failover/among-stores' details='(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z|
I am going to remove the release blocker label since this is a rare case where WAL failover is not mitigating SQL-level latency, but narrowly speaking, the observed high latency is on the read path, so WAL failover itself is working. We will continue investigating, but this is not a release blocker since the same would occur on a block cache + page cache miss for a read.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 9cbd031ecc99039507957a6bbc273a4da6775397:
(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.166684829s at 2024-05-28T14:05:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
https://github.com/cockroachdb/cockroach/issues/124399#issuecomment-2135341288 does not have a goroutine dump from the time of the failure.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 0d7110a85b5bbf0eb68bafd15abca076948b434e:
(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.010474919s at 2024-06-12T14:08:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
- #124977 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1.1-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 35c3e0c133c56c91399e730f49bc488cf6dd4c11:
(disk_stall.go:173).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.193619958s at 2024-06-16T13:43:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 35c3e0c133c56c91399e730f49bc488cf6dd4c11:
(disk_stall.go:173).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.186300147s at 2024-06-17T12:44:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
See: roachtest README
See: How To Investigate (internal)
See: Grafana
- #125773 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1.1-rc release-blocker]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/storageThis test on roachdash | Improve this report!
Jira issue: CRDB-38872