cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.51k stars 3.7k forks source link

roachtest: disk-stalled/wal-failover/among-stores failed #124399

Open cockroach-teamcity opened 1 month ago

cockroach-teamcity commented 1 month ago

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.214567s at 2024-05-19T13:43:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-38872

sumeerbhola commented 1 month ago

The following disk stall caused the problem 13:42:42 disk_stall.go:128: test status: pausing 18.999452787s before next simulated disk stall on n1 13:43:01 cluster.go:2369: running cmd sudo dmsetup suspend --nofl... on nodes [:1]; details in run_134301.673602484_n1_sudo-dmsetup-suspend.log 13:43:32 cluster.go:2369: running cmd sudo dmsetup resume data1 on nodes [:1]; details in run_134332.310261019_n1_sudo-dmsetup-resume-.log 13:43:42 disk_stall.go:128: test status: pausing 9m18.999550257s before next simulated disk stall on n1

The pmax of observed WAL fsync latency is not unexpectedly high.

Screenshot 2024-05-19 at 5 11 26 PM

Unlike https://github.com/cockroachdb/cockroach/issues/122364, this is not encrypted FS. KV p99 is 10+s and n1 lost its leases:

Screenshot 2024-05-19 at 5 14 44 PM Screenshot 2024-05-19 at 5 15 52 PM
cockroach-teamcity commented 1 month ago

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

sumeerbhola commented 1 month ago

The p99 goroutine scheduling latency is low, and the number of runnable goroutines is also low, but we have slot exhaustion. Which must be a side-effect of request processing being stuck. We don't have latency histograms for the storage read path, which would allow us to narrow down.

Screenshot 2024-05-21 at 1 05 16 PM Screenshot 2024-05-21 at 1 05 53 PM Screenshot 2024-05-21 at 1 06 03 PM
sumeerbhola commented 1 month ago

We could enhance the test to monitor the used slots on n1 and if it exceeds 500, take a goroutine dump, so we know where those goroutines are stuck.

sumeerbhola commented 1 month ago

Never mind. We have a dump.

sumeerbhola commented 1 month ago

There are > 2500 goroutines stuck waiting on [1]

<-v.loaded

in tableCacheShard, so they are waiting for the cache value (*Reader etc.) to be loaded from disk. There are 3 goroutines that are trying to do the load

goroutine 2557115 [runnable]:
runtime/pprof.Do({0x7d2ee08?, 0xca9b180?}, {{0xc00138ed00?, 0x0?, 0x0?}}, 0xc00a4d82e8)
    GOROOT/src/runtime/pprof/runtime.go:52 +0xad
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x603b?}, 0xc00175b958)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:931 +0x43c
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c

Which is this code https://github.com/cockroachdb/pebble/blob/c077435aef13d640fc599ad01d10455f68510d1a/table_cache.go#L931-L933

But they are in the runtime package here https://github.com/golang/go/blob/master/src/runtime/pprof/runtime.go#L47-L52 which (since this is at the closing brace of the function) looks like defer SetGoroutineLabels(ctx), which calls into https://github.com/golang/go/blob/master/src/runtime/proflabel.go#L12-L35. Unclear why this should be stuck here. [1]

goroutine 2942846 [runnable]:
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x6a03?}, 0xc00175b958)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:862 +0x10f
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc0036446e0, {0x7d30150, 0xc0e3361a70}, 0xc03fc99a00, 0xc0e301b170, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:505 +0x97
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x7f82ead0a064?, {0x7d30150?, 0xc0e3361a70?}, 0x115a6ca?, 0xc00f372b40?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:219 +0x9e
github.com/cockroachdb/pebble.tableNewIters.TODO(0xd?, {0x7d30150?, 0xc0e3361a70?}, 0x125b0e5?, 0xc0e301b2d8?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}})
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:90 +0x6e
github.com/cockroachdb/pebble.(*levelIter).loadFile(0xc0e301b108, 0xc03fc99a00, 0x1)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:647 +0x2d1
github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc0e301b108, {0xc0a5d23140, 0xd, 0xd}, {0xc0a5d23670, 0xd, 0xd}, 0x0)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:725 +0xb7
github.com/cockroachdb/pebble.(*mergingIter).seekGE(0xc0e301ac00, {0xc0a5d23670?, 0x0?, 0x17d0e7acb5255d37?}, 0x0?, 0x0)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1035 +0xf1
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGEStrict(0xc0e301ac00, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0x1000121ebcf?, 0x7f80187fa960?}, 0x10?)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1113 +0x65
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE(0xc01993b008?, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0xc008c0e9b8?, 0x4b6789?}, 0xd?)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1104 +0x1d
github.com/cockroachdb/pebble.(*lazyCombinedIter).SeekPrefixGE(0xc0e301aa68, {0xc0a5d23140?, 0x0?, 0xd?}, {0xc0a5d23670, 0xd, 0xd}, 0x10?)
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/range_keys.go:617 +0x57
github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE(0xc0e301a608, {0xc0a5d23670, 0xd, 0xd})
    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:1492 +0x564
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE(0xc08ae59858, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}})
    github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:373 +0xc5
github.com/cockroachdb/cockroach/pkg/storage.mvccGetMetadata({0x7db7b60, 0xc08ae59858}, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}}, 0xc0e30db440)
    github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1656 +0xa5
github.com/cockroachdb/cockroach/pkg/storage.mvccPutInternal({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
    github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2277 +0x3b4
github.com/cockroachdb/cockroach/pkg/storage.mvccPutUsingIter({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
    github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2032 +0x1f6
github.com/cockroachdb/cockroach/pkg/storage.MVCCPut({_, _}, {_, _}, {_, _, _}, {_, _}, {{0xc0e3428900, ...}, ...}, ...)
    github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1939 +0x494
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.Put({_, _}, {_, _}, {{0x7dd6d08, 0xc0e33070c0}, {{0x17d0e7afce337fe9, 0x0}, 0x0, {0x17d0e7afce378f6f, ...}, ...}, ...}, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_put.go:74 +0x391
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x17d0e7afce337fe9, 0x0}, ...}, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:488 +0x3a9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc0e33698c0, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:305 +0xa5d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:736 +0x1d6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
    github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:703 +0x359
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluate1PC(_, {_, _}, {_, _}, _, _, _)
sumeerbhola commented 1 month ago

here is the dump goroutine_dump.2024-05-19T13_43_12.731.double_since_last_dump.000007660.txt.gz

sumeerbhola commented 1 month ago

The test failure in https://github.com/cockroachdb/cockroach/issues/124399#issuecomment-2120569707 does not have a goroutine dump from the stall corresponding to the failure. I have not yet looked at the metrics.

14:24:38 test_runner.go:1098: ##teamcity[testFailed name='disk-stalled/wal-failover/among-stores' details='(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z|

sumeerbhola commented 1 month ago

I am going to remove the release blocker label since this is a rare case where WAL failover is not mitigating SQL-level latency, but narrowly speaking, the observed high latency is on the read path, so WAL failover itself is working. We will continue investigating, but this is not a release blocker since the same would occur on a block cache + page cache miss for a read.

cockroach-teamcity commented 1 month ago

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 9cbd031ecc99039507957a6bbc273a4da6775397:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.166684829s at 2024-05-28T14:05:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

sumeerbhola commented 4 weeks ago

https://github.com/cockroachdb/cockroach/issues/124399#issuecomment-2135341288 does not have a goroutine dump from the time of the failure.

cockroach-teamcity commented 3 weeks ago

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 0d7110a85b5bbf0eb68bafd15abca076948b434e:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.010474919s at 2024-06-12T14:08:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #124977 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1.1-rc]

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 weeks ago

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 35c3e0c133c56c91399e730f49bc488cf6dd4c11:

(disk_stall.go:173).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.193619958s at 2024-06-16T13:43:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 weeks ago

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 35c3e0c133c56c91399e730f49bc488cf6dd4c11:

(disk_stall.go:173).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.186300147s at 2024-06-17T12:44:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #125773 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1.1-rc release-blocker]

This test on roachdash | Improve this report!