Closed herkolategan closed 11 months ago
I looked into this, and I believe the problem is that we have a race when starting shared-process tenancy in multi-node cluster. Namely, the race is between TestServerController.Activate
(which explicitly starts a shared-process tenant on a server, with the correct testing knobs) and serverController
running on n2 and n3 starting the tenants itself because it was notified via the rangefeed - crucially, at that point the controller is not aware of the testing knobs. (I ran into a similar problem in c5d847f65f20b4193eee397369642ea7c8c5a0dc where I had to tweak the server initialization order to remove the race.)
I briefly looked into how we can fix this race, and I didn't immediately see an easy way to do so. @stevendanna curious if you have some ideas around this. I think we should fix this proper sooner rather than later, and I'm happy to do that, but I'd appreciate some pointers.
(BTW an easy way to reliably repro this problem on TestTraceAnalyzer
is to add time.Sleep(time.Second)
right after each s.Activate
call in TestCluster.Start
.)
@yuzefovich I agree we need to fix this. I suspect this will be one of the main causes of shared-process test issues.
TestServer has code to inject the testing knobs into the server controller in StartSharedProcessTenant:
// Save the args for use if the server needs to be created.
func() {
ts.topLevelServer.serverController.mu.Lock()
defer ts.topLevelServer.serverController.mu.Unlock()
ts.topLevelServer.serverController.mu.testArgs[args.TenantName] = args
}()
But, this only happens on the first server and not the other servers. I haven't thought about this much but perhaps we could do 2 things. For the randomized tenants, we always create the tenant with the same name:
func (ts *testServer) startSharedProcessDefaultTestTenant(
ctx context.Context,
) (serverutils.ApplicationLayerInterface, error) {
params := base.TestSharedProcessTenantArgs{
TenantName: "test-tenant",
TenantID: serverutils.TestTenantID(),
Knobs: ts.params.Knobs,
UseDatabase: ts.params.UseDatabase,
}
// See comment above on separate process tenant regarding the testing knobs.
params.Knobs.Server = &TestingKnobs{}
if ts.params.Knobs.Server != nil {
params.Knobs.Server.(*TestingKnobs).DiagnosticsTestingKnobs = ts.params.Knobs.Server.(*TestingKnobs).DiagnosticsTestingKnobs
}
tenant, _, err := ts.StartSharedProcessTenant(ctx, params)
if err != nil {
return nil, err
}
return tenant, err
}
So, maybe instead of waiting to populate those tenant testing knobs until we start the tenant, we could unconditionally add them to the server controller during the testserver start up so that if a test tenant is started, they will get the testing knobs by default.
Then perhaps we could also provide an API for starting tenants on test cluster that adds the knobs to the controller for each service for tests that need to manually start tenants with parameters.
Thanks for suggestions, I'll explore them. Another idea might be to delay starting the controller goroutine until Activate
is called on all servers in the cluster to simply eliminate the race.
Describe the problem
When running
TestTraceAnalyzer
with a shared process virtual cluster the expected result is not always present.How to reproduce
Set the following in
TestServerArgs
to force a shared process tenant and cause the failure to happen.Expected behavior The test should pass when running with a shared process virtual cluster, unless after investigation the test is expected not to work with one.
Ref: https://github.com/cockroachdb/cockroach/issues/112857 Epic: CRDB-26687
Jira issue: CRDB-33065