Closed cockroach-teamcity closed 2 months ago
roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 34caf9d2c267f95f6aef2713708806e6f14948d6:
(test_runner.go:1179).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=aws
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
There's smoke here. The number of retried sstable requests seems very large
❯ grep "restore" logs/*.unredacted/cockroach.teamcity*.log | grep "cannot be added spanning range bounds" | wc -l
50839
If we assume each retried sst is 16 MB, our target sst size, 16 MB * 50839 = 1TB of data. I'm not sure if that's going to hamper things.
The distsql flow retried twice, 6 hours apart point due to memory monitoring, in the middle of the restore (first restart occurred about 10 hours into the job). These were the only retryable errors recorded.
logs/1.unredacted/cockroach.teamcity-14417981-1710566964-05-n15cpu16-0001.ubuntu.2024-03-17T00_03_35Z.004276.log:W240317 01:35:46.332888 53420 ccl/backupccl/restore_job.go:211 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=951973372424585217›] 88526 encountered retryable error: importing 770240 ranges: running distributed restore: running distSQL flow: reading restore span entries: this query requires additional disk space: flow-disk-monitor: disk budget exceeded: 1048576 bytes requested, 34359738368 currently allocated, 0 bytes in budget
logs/1.unredacted/cockroach.teamcity-14417981-1710566964-05-n15cpu16-0001.ubuntu.2024-03-17T06_55_46Z.004276.log:W240317 12:00:48.460897 53420 ccl/backupccl/restore_job.go:211 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=951973372424585217›] 142504 encountered retryable error: importing 628183 ranges: running distributed restore: running distSQL flow: reading restore span entries: this query requires additional disk space: flow-disk-monitor: disk budget exceeded: 1048576 bytes requested, 34359738368 currently allocated, 0 bytes in budget
Small note: our target SST size is 16MB.
ah, sorry 384 is our target restore span entry size. Hrm. So maybe this retried sst thing isn't the problem. We're only retrying 1 TB of logical data.
I have a theory for this regression: after #119840 landed, a restore with 400 layers and a 200 file cap per restore span entry, will make smaller restore spans that never split on size. Note that the 200 file cap is a soft cap: if the base span is [a-d), we still put all inc files that intersect with [a-d) in that span entry-- we'll just never extend the span. Running this roachtest on the commit before the 200 file cap, confirms this pr is responsible for this regression.
roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 4a9385cacb82e7a8d6d37e5d9a26a6b7c845aab6:
(test_runner.go:1185).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=aws
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
relabelling this as a ga blocker, as the bug is on 23.1 through 24.1. The bug only affects a corner case of restores: restores from a backup greater than 200 inc backups.
roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 2a5e231716c436781f12452d800651f51c6383b7:
(test_runner.go:1185).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=aws
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
i still need to backport this to 23.1 https://github.com/cockroachdb/cockroach/pull/121804
roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 8a97a5edd98336e2dd04ef12f08628fba84b17dd:
(monitor.go:154).Wait: monitor failure: read tcp 172.17.0.3:46056 -> 18.218.200.34:26257: read: connection reset by peer
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=aws
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
Closing. no longer going to backport to 23.1
roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 72646a555214c0705781e440b9df585d5eea9511:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=aws
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
/cc @cockroachdb/disaster-recoveryThis test on roachdash | Improve this report!
Jira issue: CRDB-36534