Open rafiss opened 11 months ago
Some notes with no conclusion.
Looking at the logs from the most recent failure posted by @msbutler we see that this test timed out in 1m48s
panic: test timed out after 1m48s
Just over 1 minute of the runtime is spent before we even start a tenant. It is spent on test cluster startup and wait for full replication:
I231129 04:11:59.021217 LOG STARTS
I231129 04:13:00.002924 12 testutils/testcluster/testcluster.go:1422 â‹® [-] 430 WaitForFullReplication took: 43.69734821s
It's another 5 seconds until we actually start trying to create the tenant:
I231129 04:13:05.414871 12 1@server/tenant.go:298 â‹® [-] 459 server starting for tenant "10"
The tenant startup gets to the permanent migrations about 5 seconds later:
I231129 04:13:11.454353 12 upgrade/upgrademanager/manager.go:169 â‹® [T10,Vcluster-10,nsql1] 495 running permanent upgrades up to version: 23.1-32
The first of these migrations took 7 seconds. With subsequent migrations taking about the same amount of time until the timeout.
I231129 04:13:15.159947 12 upgrade/upgrademanager/manager.go:742 ⋮ [T10,Vcluster-10,nsql1] 518 running ‹Upgrade to 0.0-2: "add users and roles"›
I231129 04:13:15.160355 17885 jobs/adopt.go:262 â‹® [T10,Vcluster-10,nsql1] 519 job 921307687705804801: resuming execution
I231129 04:13:22.747641 12 jobs/wait.go:153 â‹® [T10,Vcluster-10,nsql1] 548 waited for 1 [921307687705804801] queued jobs to complete 7.587508413s
Two things standout to me: (1) we are slow enough that the TestServer start up itself ate up over half of our allotted time. 1 minute to start up a 3 node startup cluster. (2) 5-7 seconds per migration job is very bad. It is between 5-10x slower than the migrations took in the system tenant in the same run.
Also of interest is that this test has some curious config:
ServerArgs: base.TestServerArgs{
Knobs: base.TestingKnobs{
SpanConfig: &spanconfig.TestingKnobs{
ManagerDisableJobCreation: true, // TODO(irfansharif): #74919.
}},
DefaultTestTenant: base.TestControlsTenantsExplicitly,
},
That was added in response to https://github.com/cockroachdb/cockroach/issues/74919
I've been working on a series of changes to reduce the overhead of the job system that might be interesting to see in use on this test. We might also see the impact of the DontUseJobs testing knob.
Saw it here under stressrace too. I'll send a patch to skip under stressrace.
Oh, it's already skipped under stress - I'll backport #115832 to 23.2.
Lowering to P-3 as this is now skipped.
Seen here on a 23.2 PR: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/12558118?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true
Jira issue: CRDB-33281