cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.98k stars 3.79k forks source link

statusccl: TestTenantCannotSeeNonTenantStats fails under stressrace #113984

Open rafiss opened 11 months ago

rafiss commented 11 months ago

Seen here on a 23.2 PR: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/12558118?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

=== RUN   TestTenantCannotSeeNonTenantStats
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/8bc2de686563d04c3522b0df244fc129/logTestTenantCannotSeeNonTenantStats1233868257
    test_log_scope.go:81: use -show-logs to present logs inline
panic: test timed out after 1m30s

Jira issue: CRDB-33281

msbutler commented 10 months ago

hit this again https://teamcity.cockroachdb.com/viewLog.html?buildId=12768800&buildTypeId=Cockroach_BazelExtendedCi

msbutler commented 10 months ago

and again https://teamcity.cockroachdb.com/viewLog.html?buildId=12889108&buildTypeId=Cockroach_BazelExtendedCi

stevendanna commented 10 months ago

Some notes with no conclusion.

Looking at the logs from the most recent failure posted by @msbutler we see that this test timed out in 1m48s

panic: test timed out after 1m48s

Just over 1 minute of the runtime is spent before we even start a tenant. It is spent on test cluster startup and wait for full replication:

I231129 04:11:59.021217 LOG STARTS
I231129 04:13:00.002924 12 testutils/testcluster/testcluster.go:1422 â‹® [-] 430  WaitForFullReplication took: 43.69734821s

It's another 5 seconds until we actually start trying to create the tenant:

I231129 04:13:05.414871 12 1@server/tenant.go:298 â‹® [-] 459  server starting for tenant "10"

The tenant startup gets to the permanent migrations about 5 seconds later:

I231129 04:13:11.454353 12 upgrade/upgrademanager/manager.go:169 â‹® [T10,Vcluster-10,nsql1] 495  running permanent upgrades up to version: 23.1-32

The first of these migrations took 7 seconds. With subsequent migrations taking about the same amount of time until the timeout.

I231129 04:13:15.159947 12 upgrade/upgrademanager/manager.go:742 ⋮ [T10,Vcluster-10,nsql1] 518  running ‹Upgrade to 0.0-2: "add users and roles"›
I231129 04:13:15.160355 17885 jobs/adopt.go:262 â‹® [T10,Vcluster-10,nsql1] 519  job 921307687705804801: resuming execution
I231129 04:13:22.747641 12 jobs/wait.go:153 â‹® [T10,Vcluster-10,nsql1] 548  waited for 1 [921307687705804801] queued jobs to complete 7.587508413s

Two things standout to me: (1) we are slow enough that the TestServer start up itself ate up over half of our allotted time. 1 minute to start up a 3 node startup cluster. (2) 5-7 seconds per migration job is very bad. It is between 5-10x slower than the migrations took in the system tenant in the same run.

Also of interest is that this test has some curious config:

        ServerArgs: base.TestServerArgs{
            Knobs: base.TestingKnobs{
                SpanConfig: &spanconfig.TestingKnobs{
                    ManagerDisableJobCreation: true, // TODO(irfansharif): #74919.
                }},
            DefaultTestTenant: base.TestControlsTenantsExplicitly,
        },

That was added in response to https://github.com/cockroachdb/cockroach/issues/74919

I've been working on a series of changes to reduce the overhead of the job system that might be interesting to see in use on this test. We might also see the impact of the DontUseJobs testing knob.

andrewbaptist commented 9 months ago

Seen here as well: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/13197454?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

yuzefovich commented 8 months ago

Saw it here under stressrace too. I'll send a patch to skip under stressrace.

yuzefovich commented 8 months ago

Oh, it's already skipped under stress - I'll backport #115832 to 23.2.

stevendanna commented 8 months ago

Lowering to P-3 as this is now skipped.