roachtest: backup/mvcc-range-tombstones failed: slow delete range requests

cockroach-teamcity commented 1 year ago

roachtest.backup/mvcc-range-tombstones failed with artifacts on release-23.2 @ 56fd045302ed2edc14831fb001ef77486ae8f62a:

(assertions.go:333).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:788
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:948
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:705
                                main/pkg/cmd/roachtest/test_runner.go:1090
                                src/runtime/asm_amd64.s:1598
    Error:          Received unexpected error:
                    expected running status waiting for MVCC GC, but got deleting data
                    (1) attached stack trace
                      -- stack trace:
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runBackupMVCCRangeTombstones.func1.1
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:806
                      | github.com/cockroachdb/cockroach/pkg/util/retry.Options.Do
                      |     github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:172
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runBackupMVCCRangeTombstones.func1
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:788
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runBackupMVCCRangeTombstones
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:948
                      | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerBackup.func4
                      |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:705
                      | main.(*testRunner).runTest.func2
                      |     main/pkg/cmd/roachtest/test_runner.go:1090
                      | runtime.goexit
                      |     src/runtime/asm_amd64.s:1598
                    Wraps: (2) expected running status waiting for MVCC GC, but got deleting data
                    Error types: (1) *withstack.withStack (2) *errutil.leafError
    Test:           backup/mvcc-range-tombstones
(require.go:1360).NoError: FailNow called
test artifacts and logs in: /artifacts/backup/mvcc-range-tombstones/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12543086/backup-mvcc-range-tombstones/1699353144432/1699357788246)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-33277

msbutler commented 1 year ago

This roachtest failed because the RunningStatusDeletingData phase of the gc job took longer than 2 minutes, as the test asserts.

I don't know why this occurred, but this test was run with a metamorphic constants, which could lead to all sorts of fun and unexpected behavior.

cockroach-teamcity commented 1 year ago

roachtest.backup/mvcc-range-tombstones failed with artifacts on release-23.2 @ 56fd045302ed2edc14831fb001ef77486ae8f62a:

(assertions.go:333).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:823
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:857
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:705
                                main/pkg/cmd/roachtest/test_runner.go:1090
                                src/runtime/asm_amd64.s:1598
    Error:          Received unexpected error:
                    pq: hash-fingerprint: root: memory budget exceeded: 10240 bytes requested, 15982602240 currently allocated, 15982604288 bytes in budget
    Test:           backup/mvcc-range-tombstones
(require.go:1360).NoError: FailNow called
test artifacts and logs in: /artifacts/backup/mvcc-range-tombstones/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=true , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12550589/backup-mvcc-range-tombstones/1699386479994/1699387007354)

This test on roachdash | Improve this report!

msbutler commented 1 year ago

Ignoring the latest failure for now.

It seems that gc job did transition to the waiting for MVCC GC state after the test failure, a little over 2 minutes after the job started:

 915168050689114113  SCHEMA CHANGE GC  GC for DROP TABLE tpch.public.orders    node  {106} running waiting for MVCC GC 2023-11-07 11:45:24.84167+00  2023-11-07 11:45:24.907578+00 NULL  2023-11-07 11:47:27.361675+00 0 NU    LL    1 6376057147768591389 2023-11-07 11:45:24.907579+00 2023-11-07 11:45:54.907579+00 1 NULL  NULL

We transition to this job phase once all delete range requests have returned.

msbutler commented 1 year ago

The second failure occurs because SHOW EXPERIMENTAL fingerprints exceeded the max sql memory. Unrelated issue.

msbutler commented 1 year ago

The job status update was slow because the DeleteRange rpc was slow:

1.unredacted/cockroach.log:I231107 11:45:25.631572 313907 sql/gcjob/gc_job_utils.go:296 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6232  updated running status: ‹deleting data›
1.unredacted/cockroach.log:I231107 11:45:25.634842 313907 sql/gcjob/table_garbage_collection.go:208 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6233  deleting data for table 106
1.unredacted/cockroach.log:W231107 11:47:04.910281 315997 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6244  slow range RPC: have been waiting 60.04s (1 attempts) for RPC DeleteRange [/Table/106/2/‹3796847›/‹224773281›,/Table/106/2/‹4264621›/‹60340067›) to r176:‹/Table/106/2/{3796847/224773281-4264621/60340067}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:04.979481 316011 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6245  slow range RPC: have been waiting 60.11s (1 attempts) for RPC DeleteRange [/Table/106/2/‹11058844›/‹375203175›,/Table/106/2/‹11525662›/‹562365638›) to r165:‹/Table/106/2/11{058844/375203175-525662/562365638}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12, sticky=1699354842.793976559,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:05.219989 316019 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6246  slow range RPC: have been waiting 60.35s (1 attempts) for RPC DeleteRange [/Table/106/2/‹14528260›/‹438869093›,/Table/106/3) to r242:‹/Table/106/{2/14528260/438869093-3}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=11]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:05.948562 316013 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6247  slow range RPC: have been waiting 61.08s (1 attempts) for RPC DeleteRange [/Table/106/2/‹12020413›/‹551798245›,/Table/106/2/‹12487585›/‹141619073›) to r172:‹/Table/106/2/12{020413/551798245-487585/141619073}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=13, sticky=1699354935.801525687,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:05.952152 316018 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6248  slow range RPC: have been waiting 61.08s (1 attempts) for RPC DeleteRange [/Table/106/2/‹14061320›/‹144369665›,/Table/106/2/‹14528260›/‹438869093›) to r151:‹/Table/106/2/14{061320/144369665-528260/438869093}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=11]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:06.309310 316036 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6249  slow range RPC: have been waiting 61.44s (1 attempts) for RPC DeleteRange [/Table/106/3/‹9737›/‹544744292›,/Table/106/3/‹9816›/‹341083713›) to r173:‹/Table/106/3/9{737/544744292-816/341083713}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=13, sticky=1699354969.697987675,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:07.159145 316001 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6250  slow range RPC: have been waiting 62.29s (1 attempts) for RPC DeleteRange [/Table/106/2/‹6188089›/‹83358660›,/Table/106/2/‹6677099›/‹60018851›) to r147:‹/Table/106/2/6{188089/83358660-677099/60018851}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=11, sticky=1699354899.574934082,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:07.385102 315998 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6251  slow range RPC: have been waiting 62.51s (1 attempts) for RPC DeleteRange [/Table/106/2/‹4264621›/‹60340067›,/Table/106/2/‹4744319›/‹212707523›) to r276:‹/Table/106/2/4{264621/60340067-744319/212707523}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:07.627190 316014 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6252  slow range RPC: have been waiting 62.76s (1 attempts) for RPC DeleteRange [/Table/106/2/‹12487585›/‹141619073›,/Table/106/2/‹12982367›/‹102781126›) to r277:‹/Table/106/2/12{487585/141619073-982367/102781126}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=13]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
...

msbutler commented 1 year ago

I'm handing this off to kv to understand why these delete range RPCs took so long to return, leading to a test assertion failure. I don't think this 2 minute timeout should be bumped yet, as we've never seen this failure mode before. It's worth noting that this was a metamorphic build, so perhaps some metamorphic setting contributed to this timeout.

I also don't see any signs of hardware exhaustion.

miraradeva commented 12 months ago

Took a quick look, but I don't have much insight other than trying to repro with and without metamorphic parameters.

I did notice we had a similar issue (#106672) over the summer, which was fixed by @sumeerbhola on the Pebble side.

renatolabs commented 11 months ago

Test fails when using metamorphic cockroach builds. Note that runs were disabled (https://github.com/cockroachdb/cockroach/pull/114618) until we have a stable release.

Since there's an ongoing investigation, I'm leaving the issue open. Teams can decide if they want to continue pursuing a resolution / understanding of the failure.

miraradeva commented 11 months ago

I tried reproducing it with the metamorphic values from the most recent failure (#114320) but no luck. This is what I ran on a gce worker on release-23.2 @ c702128fa39c4aa84e810b898064fd2d66ace765:

ROACHTEST_ASSERTIONS_ENABLED_SEED=2653413680303407099 ./pkg/cmd/roachtest/roachstress.sh backup/mvcc-range-tombstones -- --count=50 --cpu-quota=1000

I think it's ok to close this. We can get back to it if it fails without the metamorphic build.

cockroachdb / cockroach

roachtest: backup/mvcc-range-tombstones failed: slow delete range requests #113927