kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed

cockroach-teamcity commented 1 month ago

kv/kvclient/rangefeed.TestRangeFeedIntentResolutionRace failed with artifacts on release-24.2 @ 8ed91623526524c1436eba5f1940aceec1711b47:

=== RUN   TestRangeFeedIntentResolutionRace
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/02bf21f45b4efa208143142b81d56d79/logTestRangeFeedIntentResolutionRace4063560172
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
    rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]
    rangefeed_external_test.go:1468: started rangefeed on [n3,s3,r71/3:/Tenant/1{0/Table/…-1}]
    rangefeed_external_test.go:1522: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1510
                                        github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangefeed_test/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go:1522
            Error:          Expected value not to be nil.
            Test:           TestRangeFeedIntentResolutionRace
            Messages:       nil checkpoint
    panic.go:626: -- test log scope end --
--- FAIL: TestRangeFeedIntentResolutionRace (66.41s)

Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)

Same failure on other branches

- #119340 kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed [timed out while starting tenant] [A-testing C-bug C-test-failure O-robot P-3 T-db-server branch-master]

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-42299

tbg commented 1 month ago

Summary: there was an unexplained deadlock between some txns that seem to be part of tenant start-up. This forced the migration job running as part of tenant start-up to fail. It was only retried a minute later, so the test hit timeouts that caused the above error.

This is essentially a duplicate of https://github.com/cockroachdb/cockroach/issues/119340#issuecomment-2009350607, and the analysis here corroborates that of @stevendanna at the link. We seem to see spurious job failures during early start-up, they then inject a >>30s delay, which fails the test in the way observed both here and in that other issue.

We hit the nil checkpoint here:

/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go#L1504-L1522

    waitForCheckpoint := func(t *testing.T, ts hlc.Timestamp) hlc.Timestamp {
        t.Helper()
        timeoutC := time.After(3 * time.Second)
        for {
            select {
            case c := <-checkpointC:
                require.NotNil(t, c, "nil checkpoint")
                if ts.LessEq(c.ResolvedTS) {
                    t.Logf("rangefeed checkpoint at %s >= %s", c.ResolvedTS, ts)
                    return c.ResolvedTS
                }
            case <-timeoutC:
                require.Fail(t, "timed out waiting for checkpoint", "wanted %s", ts)
            }
        }
    }

    // Wait for the initial checkpoint.
    rts := waitForCheckpoint(t, clock.Now())

There's only one write to the channel, and it's a certified non-nil checkpoint:

/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go#L1484-L1490

                case e.Checkpoint != nil:
                    // Clear checkpointC such that it always contains the latest.
                    select {
                    case <-checkpointC:
                    default:
                    }
                    checkpointC <- e.Checkpoint

This means what really happened was that the channel got closed, the only place that does this being here, when the writer goroutine exits:

/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go#L1478

        defer close(checkpointC)

That happens only when the context is done:

/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go#L1498-L1500

            case <-ctx.Done():
                return
            }

The context has a 30s timeout:

/pkg/kv/kvclient/rangefeed/rangefeed_external_test.go#L1359-L1360

    // Use a timeout, to prevent a hung test.
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

It ran for well longer than that:

--- FAIL: TestRangeFeedIntentResolutionRace (66.41s)

We (randomly) used a shared process virtual cluster in this test:

    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.

In the server logs, I see that txn starvation is preventing the tenant from fully starting:

W240918 11:51:26.922642 67974 kv/txn.go:1138 â‹® [T10,Vtest-tenant,n2] 471  have retried transaction: â€¹unnamed (id: 71bbba1f-9978-4a99-94bf-bbc7ff271178)â€º 70 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=9bde0274 key=/Tenant/10/Table/37/1/â€¹1004633734944292866â€º/â€¹0â€º iso=Serializable pri=0.04003989 epo=3 ts=1726660285.914413661,1 min=1726660284.522660260,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=71bbba1f key=/Tenant/10/Table/37/1/â€¹1004633736630468610â€º/â€¹0â€º iso=Serializable pri=0.14560758 epo=66 ts=1726660286.422224291,0 min=1726660284.713249116,0 seq=0} lock=true stat=PENDING rts=1726660286.422224291,0 wto=false gul=1726660285.213249116,0. Is the transaction stuck in a retry loop?

This affects a number of txns, but a few seconds in, a deadlock gets broken and then these errors disappear:

I240918 11:51:27.074986 73898 kv/kvserver/txnwait/queue.go:767 â‹® [T1,Vsystem,n1,s1,r69/1:â€¹/Tenant/1{0-1}â€º] 472  â€¹9bde0274â€º breaking deadlock by force push of â€¹8cdb0304â€º; dependencies=[â€¹8cdb0304â€º]

so this can't account for enough of the delay.

We see tenant startup complete only around 11:52:26 and the test starts at 11:51:23.

I240918 11:52:26.851703 66804 1@server/server_controller_new_server.go:395 â‹® [T10,Vtest-tenant,n1,tenant-orchestration,tenant=â€¹test-tenantâ€º,start-server] 727  tenant startup completed:

The split in the test for

rangefeed_external_test.go:1445: split off range r71:/Tenant/1{0/Table/Max-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7, sticky=9223372036.854775807,2147483647]

occurs only at ~11:52:29, over a minute into the test, and shortly after the tenant finally started up:

I240918 11:52:29.170167 62410 13@kv/kvserver/replica_command.go:487 â‹® [T1,Vsystem,n1,s1,r69/1:â€¹/Tenant/1{0-1}â€º,*kvpb.AdminSplitRequest] 5  initiating a split of this range at key /Tenant/10/Table/Max [r71] (manual)

So I think we wasted too much time getting to this point and the context is timing out.

Now unfortunately this is one of these tests that set up relatively complicated traps and triggers. Fortunately, they all log when they get hit, and none of them do.

The solution seems to be more straightforward: the migration job seems to (resumably) fail due to the txn contention in the beginning:

E240918 11:51:26.606802 68829 jobs/adopt.go:456 â‹® [T10,Vtest-tenant,n1] 458 job 1004633732833247233: adoption completed with error non-cancelable: running migration for 0.0-upgrading-step-004: bootstrap step â€¹"add default SQL schema telemetry schedule"â€º failed: have retried transaction: â€¹unnamed (id: 1d837283-8452-4790-adbd-5fc1cabf8f36)â€º 101 times, most recently because of the retryable error: TransactionRetryWithProtoRefreshError: failed to push meta={id=9bde0274 key=/Tenant/10/Table/37/1/â€¹1004633734944292866â€º/â€¹0â€º iso=Serializable pri=0.04003989 epo=3 ts=1726660285.914413661,1 min=1726660284.522660260,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0: "unnamed" meta={id=1d837283 key=/Tenant/10/Table/37/1/â€¹1004633734939213825â€º/â€¹1â€º/â€¹1â€º iso=Serializable pri=0.08129468 epo=100 ts=1726660286.178790201,0 min=1726660284.533259750,0 seq=0} lock=true stat=PENDING rts=1726660286.178790201,0 wto=false gul=1726660285.033259750,0. Terminating retry loop and returning error due to cluster setting kv.transaction.internal.max_auto_retries (100). Rollback error: .

The job then isn't adopted again until exactly a minute later:

I240918 11:52:26.758785 82619 jobs/adopt.go:247 â‹® [T10,Vtest-tenant,n1] 705 job 1004633732833247233: resuming execution

And it makes sense that the tenant start waits for the migration job to resume. I looked through the various jobs settings^1 and I don't see any that's one minute, but a few are 30 seconds. Either way, reasonably confident that this is what caused the test to slow down.

shubhamdhama commented 2 weeks ago

Changing the priority to P-2. The plan is to do the clean-up mentioned here https://github.com/cockroachdb/cockroach/issues/119340#issuecomment-2360756246

cockroachdb / cockroach

kv/kvclient/rangefeed: TestRangeFeedIntentResolutionRace failed #130931