cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

roachtest: restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed #120186

Closed cockroach-teamcity closed 2 months ago

cockroach-teamcity commented 7 months ago

roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 72646a555214c0705781e440b9df585d5eea9511:

(test_runner.go:1161).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-36534

cockroach-teamcity commented 7 months ago

roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 34caf9d2c267f95f6aef2713708806e6f14948d6:

(test_runner.go:1179).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

msbutler commented 7 months ago

There's smoke here. The number of retried sstable requests seems very large

❯ grep "restore" logs/*.unredacted/cockroach.teamcity*.log | grep "cannot be added spanning range bounds" | wc -l
   50839

If we assume each retried sst is 16 MB, our target sst size, 16 MB * 50839 = 1TB of data. I'm not sure if that's going to hamper things.

msbutler commented 7 months ago

The distsql flow retried twice, 6 hours apart point due to memory monitoring, in the middle of the restore (first restart occurred about 10 hours into the job). These were the only retryable errors recorded.

logs/1.unredacted/cockroach.teamcity-14417981-1710566964-05-n15cpu16-0001.ubuntu.2024-03-17T00_03_35Z.004276.log:W240317 01:35:46.332888 53420 ccl/backupccl/restore_job.go:211 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=951973372424585217›] 88526  encountered retryable error: importing 770240 ranges: running distributed restore: running distSQL flow: reading restore span entries: this query requires additional disk space: flow-disk-monitor: disk budget exceeded: 1048576 bytes requested, 34359738368 currently allocated, 0 bytes in budget
logs/1.unredacted/cockroach.teamcity-14417981-1710566964-05-n15cpu16-0001.ubuntu.2024-03-17T06_55_46Z.004276.log:W240317 12:00:48.460897 53420 ccl/backupccl/restore_job.go:211 ⋮ [T1,Vsystem,n1,job=‹RESTORE id=951973372424585217›] 142504  encountered retryable error: importing 628183 ranges: running distributed restore: running distSQL flow: reading restore span entries: this query requires additional disk space: flow-disk-monitor: disk budget exceeded: 1048576 bytes requested, 34359738368 currently allocated, 0 bytes in budget
stevendanna commented 7 months ago

Small note: our target SST size is 16MB.

msbutler commented 7 months ago

ah, sorry 384 is our target restore span entry size. Hrm. So maybe this retried sst thing isn't the problem. We're only retrying 1 TB of logical data.

msbutler commented 7 months ago

I have a theory for this regression: after #119840 landed, a restore with 400 layers and a 200 file cap per restore span entry, will make smaller restore spans that never split on size. Note that the 200 file cap is a soft cap: if the base span is [a-d), we still put all inc files that intersect with [a-d) in that span entry-- we'll just never extend the span. Running this roachtest on the commit before the 200 file cap, confirms this pr is responsible for this regression.

cockroach-teamcity commented 7 months ago

roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 4a9385cacb82e7a8d6d37e5d9a26a6b7c845aab6:

(test_runner.go:1185).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

msbutler commented 7 months ago

relabelling this as a ga blocker, as the bug is on 23.1 through 24.1. The bug only affects a corner case of restores: restores from a backup greater than 200 inc backups.

cockroach-teamcity commented 7 months ago

roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 2a5e231716c436781f12452d800651f51c6383b7:

(test_runner.go:1185).runTest: test timed out (30h0m0s)
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

msbutler commented 5 months ago

i still need to backport this to 23.1 https://github.com/cockroachdb/cockroach/pull/121804

cockroach-teamcity commented 4 months ago

roachtest.restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16 failed with artifacts on master @ 8a97a5edd98336e2dd04ef12f08628fba84b17dd:

(monitor.go:154).Wait: monitor failure: read tcp 172.17.0.3:46056 -> 18.218.200.34:26257: read: connection reset by peer
test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/inc-count=400/nodes=15/cpus=16/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

msbutler commented 2 months ago

Closing. no longer going to backport to 23.1