cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.06k stars 3.8k forks source link

sql: investigate why TestTraceAnalyzer does not work with a shared process virtual cluster #113598

Closed herkolategan closed 11 months ago

herkolategan commented 11 months ago

Describe the problem

When running TestTraceAnalyzer with a shared process virtual cluster the expected result is not always present.

--- FAIL: TestTraceAnalyzer (7.54s)
    test_log_scope.go:170: test logs captured to: /Users/herko/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/9a1aef2e646aed4bbaa53ef8a53ddeb0/logTestTraceAnalyzer1497392114
    test_log_scope.go:81: use -show-logs to present logs inline
    --- FAIL: TestTraceAnalyzer/RowExec (0.00s)
        traceanalyzer_test.go:175:
                Error Trace:    github.com/cockroachdb/cockroach/pkg/sql/execstats_test/pkg/sql/execstats/traceanalyzer_test.go:175
                Error:          Not equal:
                                expected: 3
                                actual  : 2
                Test:           TestTraceAnalyzer/RowExec
                Messages:       expected all nodes to have specified maximum disk usage
    --- FAIL: TestTraceAnalyzer/ColExec (0.00s)
        traceanalyzer_test.go:175:
                Error Trace:    github.com/cockroachdb/cockroach/pkg/sql/execstats_test/pkg/sql/execstats/traceanalyzer_test.go:175
                Error:          Not equal:
                                expected: 3
                                actual  : 2
                Test:           TestTraceAnalyzer/ColExec
                Messages:       expected all nodes to have specified maximum disk usage
    traceanalyzer_test.go:202: -- test log scope end --

How to reproduce

Set the following in TestServerArgs to force a shared process tenant and cause the failure to happen.

DefaultTestTenant: base.SharedTestTenantAlwaysEnabled,
./dev test pkg/sql/execstats -f=TestTraceAnalyzer --stress

Expected behavior The test should pass when running with a shared process virtual cluster, unless after investigation the test is expected not to work with one.

Ref: https://github.com/cockroachdb/cockroach/issues/112857 Epic: CRDB-26687

Jira issue: CRDB-33065

yuzefovich commented 11 months ago

I looked into this, and I believe the problem is that we have a race when starting shared-process tenancy in multi-node cluster. Namely, the race is between TestServerController.Activate (which explicitly starts a shared-process tenant on a server, with the correct testing knobs) and serverController running on n2 and n3 starting the tenants itself because it was notified via the rangefeed - crucially, at that point the controller is not aware of the testing knobs. (I ran into a similar problem in c5d847f65f20b4193eee397369642ea7c8c5a0dc where I had to tweak the server initialization order to remove the race.)

I briefly looked into how we can fix this race, and I didn't immediately see an easy way to do so. @stevendanna curious if you have some ideas around this. I think we should fix this proper sooner rather than later, and I'm happy to do that, but I'd appreciate some pointers.

(BTW an easy way to reliably repro this problem on TestTraceAnalyzer is to add time.Sleep(time.Second) right after each s.Activate call in TestCluster.Start.)

stevendanna commented 11 months ago

@yuzefovich I agree we need to fix this. I suspect this will be one of the main causes of shared-process test issues.

TestServer has code to inject the testing knobs into the server controller in StartSharedProcessTenant:

    // Save the args for use if the server needs to be created.
    func() {
        ts.topLevelServer.serverController.mu.Lock()
        defer ts.topLevelServer.serverController.mu.Unlock()
        ts.topLevelServer.serverController.mu.testArgs[args.TenantName] = args
    }()

But, this only happens on the first server and not the other servers. I haven't thought about this much but perhaps we could do 2 things. For the randomized tenants, we always create the tenant with the same name:

func (ts *testServer) startSharedProcessDefaultTestTenant(
    ctx context.Context,
) (serverutils.ApplicationLayerInterface, error) {
    params := base.TestSharedProcessTenantArgs{
        TenantName:  "test-tenant",
        TenantID:    serverutils.TestTenantID(),
        Knobs:       ts.params.Knobs,
        UseDatabase: ts.params.UseDatabase,
    }
    // See comment above on separate process tenant regarding the testing knobs.
    params.Knobs.Server = &TestingKnobs{}
    if ts.params.Knobs.Server != nil {
        params.Knobs.Server.(*TestingKnobs).DiagnosticsTestingKnobs = ts.params.Knobs.Server.(*TestingKnobs).DiagnosticsTestingKnobs
    }

    tenant, _, err := ts.StartSharedProcessTenant(ctx, params)
    if err != nil {
        return nil, err
    }

    return tenant, err
}

So, maybe instead of waiting to populate those tenant testing knobs until we start the tenant, we could unconditionally add them to the server controller during the testserver start up so that if a test tenant is started, they will get the testing knobs by default.

Then perhaps we could also provide an API for starting tenants on test cluster that adds the knobs to the controller for each service for tests that need to manually start tenants with parameters.

yuzefovich commented 11 months ago

Thanks for suggestions, I'll explore them. Another idea might be to delay starting the controller goroutine until Activate is called on all servers in the cluster to simply eliminate the race.