Closed cockroach-teamcity closed 11 months ago
This roachtest failed because the RunningStatusDeletingData
phase of the gc job took longer than 2 minutes, as the test asserts.
I don't know why this occurred, but this test was run with a metamorphic constants, which could lead to all sorts of fun and unexpected behavior.
roachtest.backup/mvcc-range-tombstones failed with artifacts on release-23.2 @ 56fd045302ed2edc14831fb001ef77486ae8f62a:
(assertions.go:333).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:823
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:857
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:705
main/pkg/cmd/roachtest/test_runner.go:1090
src/runtime/asm_amd64.s:1598
Error: Received unexpected error:
pq: hash-fingerprint: root: memory budget exceeded: 10240 bytes requested, 15982602240 currently allocated, 15982604288 bytes in budget
Test: backup/mvcc-range-tombstones
(require.go:1360).NoError: FailNow called
test artifacts and logs in: /artifacts/backup/mvcc-range-tombstones/run_1
Parameters: ROACHTEST_arch=amd64
, ROACHTEST_cloud=gce
, ROACHTEST_cpu=8
, ROACHTEST_encrypted=true
, ROACHTEST_fs=ext4
, ROACHTEST_localSSD=true
, ROACHTEST_metamorphicBuild=true
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12550589/backup-mvcc-range-tombstones/1699386479994/1699387007354)
Ignoring the latest failure for now.
It seems that gc job did transition to the waiting for MVCC GC
state after the test failure, a little over 2 minutes after the job started:
915168050689114113 SCHEMA CHANGE GC GC for DROP TABLE tpch.public.orders node {106} running waiting for MVCC GC 2023-11-07 11:45:24.84167+00 2023-11-07 11:45:24.907578+00 NULL 2023-11-07 11:47:27.361675+00 0 NU LL 1 6376057147768591389 2023-11-07 11:45:24.907579+00 2023-11-07 11:45:54.907579+00 1 NULL NULL
We transition to this job phase once all delete range requests have returned.
The second failure occurs because SHOW EXPERIMENTAL fingerprints exceeded the max sql memory. Unrelated issue.
The job status update was slow because the DeleteRange rpc was slow:
1.unredacted/cockroach.log:I231107 11:45:25.631572 313907 sql/gcjob/gc_job_utils.go:296 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6232 updated running status: ‹deleting data›
1.unredacted/cockroach.log:I231107 11:45:25.634842 313907 sql/gcjob/table_garbage_collection.go:208 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6233 deleting data for table 106
1.unredacted/cockroach.log:W231107 11:47:04.910281 315997 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6244 slow range RPC: have been waiting 60.04s (1 attempts) for RPC DeleteRange [/Table/106/2/‹3796847›/‹224773281›,/Table/106/2/‹4264621›/‹60340067›) to r176:‹/Table/106/2/{3796847/224773281-4264621/60340067}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:04.979481 316011 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6245 slow range RPC: have been waiting 60.11s (1 attempts) for RPC DeleteRange [/Table/106/2/‹11058844›/‹375203175›,/Table/106/2/‹11525662›/‹562365638›) to r165:‹/Table/106/2/11{058844/375203175-525662/562365638}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12, sticky=1699354842.793976559,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:05.219989 316019 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6246 slow range RPC: have been waiting 60.35s (1 attempts) for RPC DeleteRange [/Table/106/2/‹14528260›/‹438869093›,/Table/106/3) to r242:‹/Table/106/{2/14528260/438869093-3}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=11]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:05.948562 316013 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6247 slow range RPC: have been waiting 61.08s (1 attempts) for RPC DeleteRange [/Table/106/2/‹12020413›/‹551798245›,/Table/106/2/‹12487585›/‹141619073›) to r172:‹/Table/106/2/12{020413/551798245-487585/141619073}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=13, sticky=1699354935.801525687,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:05.952152 316018 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6248 slow range RPC: have been waiting 61.08s (1 attempts) for RPC DeleteRange [/Table/106/2/‹14061320›/‹144369665›,/Table/106/2/‹14528260›/‹438869093›) to r151:‹/Table/106/2/14{061320/144369665-528260/438869093}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=11]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:06.309310 316036 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6249 slow range RPC: have been waiting 61.44s (1 attempts) for RPC DeleteRange [/Table/106/3/‹9737›/‹544744292›,/Table/106/3/‹9816›/‹341083713›) to r173:‹/Table/106/3/9{737/544744292-816/341083713}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=13, sticky=1699354969.697987675,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:07.159145 316001 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6250 slow range RPC: have been waiting 62.29s (1 attempts) for RPC DeleteRange [/Table/106/2/‹6188089›/‹83358660›,/Table/106/2/‹6677099›/‹60018851›) to r147:‹/Table/106/2/6{188089/83358660-677099/60018851}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=11, sticky=1699354899.574934082,0]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:07.385102 315998 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6251 slow range RPC: have been waiting 62.51s (1 attempts) for RPC DeleteRange [/Table/106/2/‹4264621›/‹60340067›,/Table/106/2/‹4744319›/‹212707523›) to r276:‹/Table/106/2/4{264621/60340067-744319/212707523}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=12]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
1.unredacted/cockroach.log:W231107 11:47:07.627190 316014 kv/kvclient/kvcoord/dist_sender.go:1894 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=915168050689114113›] 6252 slow range RPC: have been waiting 62.76s (1 attempts) for RPC DeleteRange [/Table/106/2/‹12487585›/‹141619073›,/Table/106/2/‹12982367›/‹102781126›) to r277:‹/Table/106/2/12{487585/141619073-982367/102781126}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=13]; resp: ‹(err: <nil>), *kvpb.DeleteRangeResponse›
...
I'm handing this off to kv to understand why these delete range RPCs took so long to return, leading to a test assertion failure. I don't think this 2 minute timeout should be bumped yet, as we've never seen this failure mode before. It's worth noting that this was a metamorphic build, so perhaps some metamorphic setting contributed to this timeout.
I also don't see any signs of hardware exhaustion.
Took a quick look, but I don't have much insight other than trying to repro with and without metamorphic parameters.
I did notice we had a similar issue (#106672) over the summer, which was fixed by @sumeerbhola on the Pebble side.
Test fails when using metamorphic cockroach builds. Note that runs were disabled (https://github.com/cockroachdb/cockroach/pull/114618) until we have a stable release.
Since there's an ongoing investigation, I'm leaving the issue open. Teams can decide if they want to continue pursuing a resolution / understanding of the failure.
I tried reproducing it with the metamorphic values from the most recent failure (#114320) but no luck. This is what I ran on a gce worker on release-23.2 @ c702128fa39c4aa84e810b898064fd2d66ace765:
ROACHTEST_ASSERTIONS_ENABLED_SEED=2653413680303407099 ./pkg/cmd/roachtest/roachstress.sh backup/mvcc-range-tombstones -- --count=50 --cpu-quota=1000
I think it's ok to close this. We can get back to it if it fails without the metamorphic build.
roachtest.backup/mvcc-range-tombstones failed with artifacts on release-23.2 @ 56fd045302ed2edc14831fb001ef77486ae8f62a:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=true
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_metamorphicBuild=true
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12543086/backup-mvcc-range-tombstones/1699353144432/1699357788246)
/cc @cockroachdb/disaster-recovery
This test on roachdash | Improve this report!
Jira issue: CRDB-33277