kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant]

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ a58e89a2e54e3a5ad73edcce48a669e516cceedd:

Fatal error:

panic: test timed out after 4m57s
running tests:
    TestRangeFeedIntentResolutionRace (4m33s)

Stack:

goroutine 73770 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2259 +0x320
created by time.goFunc
    GOROOT/src/time/sleep.go:176 +0x38

Log preceding fatal error

``` === RUN TestRangeFeedIntentResolutionRace test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace448879069 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-36159

The test cluster start-up hangs?

goroutine 34902 [select, 4 minutes]:
github.com/cockroachdb/cockroach/pkg/server.(*serverController).startAndWaitForRunningServer(0x4007f2e1c0, {0x5f8b298, 0x92782e0}, {0x4b7297d?, 0x1bf08eb000?})
    github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:232 +0xcc
github.com/cockroachdb/cockroach/pkg/server.(*testServer).StartSharedProcessTenant(_, {_, _}, {{0x4b7297d, 0xb}, {0xa}, {{0x5f48680, 0x4006419c00}, {0x0, 0x0}, ...}, ...})
    github.com/cockroachdb/cockroach/pkg/server/testserver.go:1381 +0x474
github.com/cockroachdb/cockroach/pkg/server.(*testServer).startSharedProcessDefaultTestTenant(0x4003930000?, {0x5f8b298, 0x92782e0})
    github.com/cockroachdb/cockroach/pkg/server/testserver.go:660 +0x88
github.com/cockroachdb/cockroach/pkg/server.(*testServer).maybeStartDefaultTestTenant(0x4007be6e00, {0x5f8b298?, 0x92782e0?})
    github.com/cockroachdb/cockroach/pkg/server/testserver.go:710 +0x24c
github.com/cockroachdb/cockroach/pkg/server.(*testServer).Activate(0x4007be6e00, {0x5f8b298, 0x92782e0})
    github.com/cockroachdb/cockroach/pkg/server/testserver.go:836 +0xbc
github.com/cockroachdb/cockroach/pkg/testutils/serverutils.(*wrap).Activate(0x5f68d08?, {0x5f8b298, 0x92782e0})
    github.com/cockroachdb/cockroach/bazel-out/aarch64-fastbuild/bin/pkg/testutils/serverutils/ts_control_forwarder_generated.go:25 +0x54
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0x4007747800, {0x5fc1d80, 0x4008ceb380?})
    github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:484 +0x634
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster({_, _}, _, {{{{0x5f48680, 0x4006419880}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
    github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:238 +0x60
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test.TestRangeFeedIntentResolutionRace(0x4008ceb380)
    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1321 +0x338
testing.tRunner(0x4008ceb380, 0x4ee2c10)
    GOROOT/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 1
    GOROOT/src/testing/testing.go:1648 +0x33c

We've seen the same failure in https://github.com/cockroachdb/cockroach/issues/115345, https://github.com/cockroachdb/cockroach/issues/115080, and https://github.com/cockroachdb/cockroach/issues/114210.

Since these stuck startups are in StartSharedProcessTenant, I'm going to move this over to T-multitenant and allow them to make the call on whether this should remain a release-blocker and what to do with it.

I am not sure if I've reproduced the exact same thing, but I'm able to reliably get this test to time out under stress with a 1 minute timeout.

What is happening here is interesting. I don't have the whole story yet.

This is a 3 node test. As the 3 tenant SQL servers start up, they all see that the permanent migrations need to run:

I240320 10:55:12.666645 4455 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n3] 312  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades
I240320 10:55:12.691847 4636 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n1] 328  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades
I240320 10:55:13.099549 4485 upgrade/upgrademanager/manager.go:237  [T10,Vtest-tenant,n2] 382  the last permanent upgrade (v0.0-upgrading-step-004) does not appear to have completed; attempting to run all upgrades

It appears n1 won the race to create the migration job, the other two nodes simply wait for n1's job:

I240320 10:55:13.046709 4455 upgrade/upgrademanager/manager.go:774  [T10,Vtest-tenant,n3] 379  found existing migration job 953095660148752385 for version 0.0-upgrading-step-004 in status running, waiting
I240320 10:55:13.147829 4485 upgrade/upgrademanager/manager.go:774  [T10,Vtest-tenant,n2] 386  found existing migration job 953095660148752385 for version 0.0-upgrading-step-004 in status running, waiting

Over on n1 we make it all the way to:

I240320 10:55:13.034107 5259 upgrade/upgrades/permanent_upgrades.go:80  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 378  executing bootstrap step "add default SQL schema telemetry schedule"

It appears that the transaction for that migration is stuck:

W240320 10:55:14.298830 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 397  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 10 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=9 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.315082 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 398  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 20 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=19 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.366451 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 402  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 30 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=29 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.379330 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 403  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 40 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=39 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.473752 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 407  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 50 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=49 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.514074 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 409  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 60 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=59 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.590555 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 411  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 70 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08239248 epo=69 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.609663 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 413  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 80 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=79 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.832972 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 422  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 90 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=89 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.866958 5259 kv/txn.go:1129  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 425  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 100 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=99 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Is the transaction stuck in a retry loop?
W240320 10:55:14.868632 5259 kv/txn.go:1123  [T10,Vtest-tenant,n1,job=MIGRATION id=953095660148752385,upgrade=0.0-upgrading-step-004] 426  have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=100 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: <nil>.

This bubbles up to the resumer:

E240320 10:55:14.869355 5259 jobs/adopt.go:461  [T10,Vtest-tenant,n1] 428  job 953095660148752385: adoption completed with error non-cancelable: running migration for 0.0-upgrading-step-004: bootstrap step "add default SQL schema telemetry schedule" failed: have retried transaction: unnamed (id: e046a36b-195f-464e-aa73-0de4068f6175) 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=80899ac1 key=/Tenant/10/Table/37/1/953095661709918209/0 iso=Serializable pri=0.02727732 epo=2 ts=1710932114.107368000,1 min=1710932112.672255000,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=e046a36b key=/Tenant/10/Table/37/1/953095661872152577/0 iso=Serializable pri=0.08378192 epo=100 ts=1710932114.135756000,0 min=1710932113.034121000,0 seq=0} lock=true stat=PENDING rts=1710932114.135756000,0 wto=false gul=1710932113.534121000,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: <nil>.

But I believe that because this is a non-cancellable job, this job will still be in a "running" state, waiting for it to be re-adopted. Now, it should get adopted again, but in this 1m test run we may not be running long enough for that to happen.

I've been unable to reproduce this with a longer timeout, becuase I almost always hit:

    rangefeed_external_test.go:1428: 
                Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1416
                                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1428
                Error:          Expected value not to be nil.
                Test:           TestRangeFeedIntentResolutionRace
                Messages:       nil checkpoint

Before that happens.

Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend. Before 79219f705fc76eae5753ea5e28b7077dfacfd213 all of the backoff in the loops that try to create these schedules don't matter because they will be in a very tight db.Txn retry loop. After that commit I think the system eventually unsticks itself.

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 666fdb445868b6f27862313ee75d032687f1b3db:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2243622718
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1444: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1467: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1521: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1509
                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1521
            Error:          Expected value not to be nil.
            Test:           TestRangeFeedIntentResolutionRace
            Messages:       nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2243622718
--- FAIL: TestRangeFeedIntentResolutionRace (60.38s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 3c643b003890560b16c4fee1d1c18bea1871803b:

Fatal error:

panic: test timed out after 4m57s
running tests:
    TestRangeFeedIntentResolutionRace (4m5s)

Stack:

goroutine 89414 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
    GOROOT/src/time/sleep.go:177 +0x2d

Log preceding fatal error

``` === RUN TestRangeFeedIntentResolutionRace test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace717568773 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Parameters:

attempt=1
run=9
shard=1
Help

_{This test on roachdash | Improve this report!}

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 934684d8134fd8bb34eae1a37f3aa83a4ac066b7:

Fatal error:

panic: test timed out after 4m57s
running tests:
    TestRangeFeedIntentResolutionRace (4m0s)

Stack:

goroutine 89234 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
    GOROOT/src/time/sleep.go:177 +0x2d

Log preceding fatal error

``` === RUN TestRangeFeedIntentResolutionRace test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace1421379124 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Parameters:

attempt=1
run=2
shard=1
Help

_{This test on roachdash | Improve this report!}

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed on master @ 714dc4da48a3d2a07b5d097542808982f848f704:

Fatal error:

panic: test timed out after 4m57s
running tests:
    TestRangeFeedIntentResolutionRace (4m13s)

Stack:

goroutine 95968 [running]:
testing.(*M).startAlarm.func1()
    GOROOT/src/testing/testing.go:2366 +0x385
created by time.goFunc
    GOROOT/src/time/sleep.go:177 +0x2d

Log preceding fatal error

``` === RUN TestRangeFeedIntentResolutionRace test_log_scope.go:170: test logs captured to: outputs.zip/logTestRangeFeedIntentResolutionRace1687467905 test_log_scope.go:81: use -show-logs to present logs inline test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details. ```

Parameters:

attempt=1
run=7
shard=1
Help

Same failure on other branches

- #122569 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [C-test-failure O-robot T-kv-replication branch-release-24.1 release-blocker]

This test on roachdash | Improve this report!

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ f2e7709ee3912568de9e214560292844bf4e9f23:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4171001780
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1468: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1522: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1510
                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1522
            Error:          Expected value not to be nil.
            Test:           TestRangeFeedIntentResolutionRace
            Messages:       nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4171001780
--- FAIL: TestRangeFeedIntentResolutionRace (60.23s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #122569 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [A-testing C-bug C-test-failure O-robot P-3 T-multitenant branch-release-24.1]

This test on roachdash | Improve this report!

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 1ca84f37a5c230267fe4c9b209983db62becce6a:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2771884319
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1467: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1467
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestRangeFeedIntentResolutionRace
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace2771884319
--- FAIL: TestRangeFeedIntentResolutionRace (57.49s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #122569 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [A-testing C-bug C-test-failure O-robot P-3 T-multitenant branch-release-24.1]

This test on roachdash | Improve this report!

I think there are probably a few things going on here, but we really should try to dig into these slow tenant startup issues.

I was able to reproduce this reliably under stress on master. The same error as https://github.com/cockroachdb/cockroach/issues/119340#issuecomment-2009350607.

It timed out before rangefeed sink was started in https://github.com/cockroachdb/cockroach/blob/6807734f9e2d936cd7b60111606ce5610eb665af/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go#L1466.

./dev test pkg/kv/kvclient/rangefeed -f TestRangeFeedIntentResolutionRace --stress

--- FAIL: TestRangeFeedIntentResolutionRace (60.43s)
    test_log_scope.go:170: test logs captured to: /tmp/cockroach/_tmp/6ad7a96210225122a7001de7a6986a7e/logTestRangeFeedIntentResolutionRace2387946391
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1468: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1522:
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1510
                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1522
            Error:          Expected value not to be nil.
            Test:           TestRangeFeedIntentResolutionRace
            Messages:       nil checkpoint
    panic.go:626: -- test log scope end --

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ a488caf28f80fb12c15a67522862773355ff0c1f:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace611157878
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1443: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1465: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1465
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestRangeFeedIntentResolutionRace
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace611157878
--- FAIL: TestRangeFeedIntentResolutionRace (65.77s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 195c5cddb8ac4b8e0ae0db9a05fd6ed4e4f3fbdd:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace3313843888
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1443: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1465: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1465
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestRangeFeedIntentResolutionRace
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace3313843888
--- FAIL: TestRangeFeedIntentResolutionRace (57.65s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on master @ 7512f81be1591cde8e3b5d34bb0b9b1101f22d24:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4152603468
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1443: split off range r72:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1466: started rangefeed on [n3,s3,r72/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1520: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1508
                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1520
            Error:          Expected value not to be nil.
            Test:           TestRangeFeedIntentResolutionRace
            Messages:       nil checkpoint
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4152603468
--- FAIL: TestRangeFeedIntentResolutionRace (61.71s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

This test on roachdash | Improve this report!

TLDR: Mostly just reproducing the issue. No new observation that isn't mentioned here https://github.com/cockroachdb/cockroach/issues/119340#issuecomment-2009350607

Progress so far

I am able to reproduce all three failures mentioned here under stress. If I put a put a timeout of 1 min, it panics during startup:

❯ dev test pkg/kv/kvclient/rangefeed:rangefeed_test -f TestRangeFeedIntentResolutionRace --stress --count 20000 --timeout 1m --show-logs --verbose -- --test_output=errors

panic: test timed out after 57s
running tests:
    TestRangeFeedIntentResolutionRace (57s)
...
github.com/cockroachdb/cockroach/pkg/server.(*serverController).startAndWaitForRunningServer(0xc003fc8ee0, {0x6ac9c28, 0xa327c00}, {0x555f1ce?, 0x1bf08eb000?})
    github.com/cockroachdb/cockroach/pkg/server/server_controller.go:374 +0xf0

and this https://github.com/cockroachdb/cockroach/issues/119340#issuecomment-2009350607 seems correct

Focusing for a moment on why we might have contention, I can see clearly why there might be contention in the short term during startup. Namely, before the permanent migrations run that create these new schedules, we also start goroutines on every node that poll the schedules table and tries to create the schedule. So, it makes some sense that all these queries trying to create the same schedule may contend.

I verified by just putting some stack trace in https://github.com/cockroachdb/cockroach/blob/aeec29996e7199575d7585a7851756d2331a81d2/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go#L212-L218

When the --timeout is set to some larger number (2m) it fails inconsistently between,

            Error:          Expected value not to be nil.

and

            Error:          Received unexpected error:
                            context deadline exceeded

and they fail after successful adoption and completion of the migration,

I240903 14:58:07.550793 15874 1@jobs/structured_log.go:60  [T10,Vtest-tenant,n3] 505 ={"Timestamp":1725375487550789995,"EventType":"status_change","JobID":1000423535935651842,"JobType":"MIGRATION","Description":"status changed to: succeeded","PreviousStatus":"running","NewStatus":"succeeded","RunNum":2}

Just out of curiosity, with --timeout of 2mins, I tried increasing the context timeout value to a significant period,

    ctx, cancel := context.WithTimeout(context.Background(), 300*time.Second)

but it failed unexpectedly even before giving up on transaction retries,

# this line just to show how soon it panicked
I240904 15:17:40.782599 16 1@server/server.go:2005  [T1,Vsystem,n1] 46  starting postgres server at 127.0.0.1:34155 (use: 127.0.0.1:34155)
...
I240904 15:17:43.906064 4795 1@sql/catalog/schematelemetry/schematelemetrycontroller/controller.go:217  [T10,Vtest-tenant,n3] 429 + github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x3fe
panic: error verifying key /Tenant/10/Table/37/1/1000710681101893635/0; lock table num=14
 lock: /Tenant/10/Table/37/1/1000710677910618115/0
  holder: txn: 08c2bf26-0a5a-44a6-8e31-809b8a2bf9f3 epoch: 2, iso: Serializable, ts: 1725463063.883275891,1, info: repl [Intent]
   waiting readers:
    req: 448, txn: e5c74f28-60ce-40ed-b2dd-15b0dab6f563
    req: 447, txn: 0eba4bd7-25f6-46a8-93fd-0483e1300e54
 lock: /Tenant/10/Table/37/1/1000710677910618115/1/1
  holder: txn: 08c2bf26-0a5a-44a6-8e31-809b8a2bf9f3 epoch: 2, iso: Serializable, ts: 1725463063.883275891,1, info: repl [Intent]

Anyway, my first goal here is to understand why startup is slow? So I'm sidelining issues when--timeout is 2 mins.

In my knowledge since only this test case is facing slow startup, so my suspicion is the test setup done here before starting the test server is causing that txns contention.

I just investigated the same failure over in https://github.com/cockroachdb/cockroach/issues/130931#issuecomment-2360695688. I'll now mark that one as a duplicate of this one and assign it to T-db-server as well. In case it helps, the short-form analysis seems to be:

the migrations job sometimes hits contention and fails
the contention is ultimately resolved via breaking a deadlock (at least in the linked issue)
it's then only re-adopted after 30s or a minute or something like that
that's too long for the test, it has a 30s timeout on its context.
it then fails with the "nil checkpoint" which is really a result of the checkpoint channel being closed as a result of context cancellation.

We can:

understand why there is a contention deadlock while starting up a tenant; we really shouldn't have those in our internal txns. @stevendanna seems to have had some ideas who is involved in the deadlock at the bottom of this comment. Once understood, we should be able to make some change that at least prevents a deadlock. (Perhaps taking out an extra lock, or using read committed, who knows).
give the test a larger timeout, i.e. change this to something like 2 minutes.

The former seems like a real solution and would solve a problem that seems likely to also affect production. The second is pure band aid. My suggestion is to do both, in the order in which they appear here.

Uff. Apologies I dropped the ball on this and so many have spent time re-investigating this issue.

The first bit of contention I originally found was around this telemetry schedule creation.

1) Move this call out to a different method that we only start after the migrations are done. Other parts of server startup have a similar multi-part startup.

https://github.com/cockroachdb/cockroach/blob/285460a47358406e65df9ff3deb5a0764f58a2bf/pkg/sql/conn_executor.go#L620

2) Remove the permanent migration to add the schedule completely. If we have this anti-entropy mechanism that is going to check every startup, why not just rely on that?

3) Understand why the schedule creation contends for so long. Sure, they are both trying to create the same schedule, but it kinda seems to me like one of these writers should win eventually.

The second approach seems more logical to me. As you pointed out, we already have startup code that ensures the schedule is in place (see below), so it's unclear why a permanent upgrade task would also be needed for this.

https://github.com/cockroachdb/cockroach/blob/aeec29996e7199575d7585a7851756d2331a81d2/pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go#L113

cockroachdb / cockroach

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] #119340