cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

roachtest: tpcc/mixed-headroom/n5cpu16 failed #133007

Open cockroach-teamcity opened 2 days ago

cockroach-teamcity commented 2 days ago

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on release-24.3 @ 4cbedefd790c75cb0f21f77ed8d917c8528a7d15:

(mixedversion.go:732).Run: preparing to run step 8: failed to get binary version for node 1 (mixed-version-tenant-yjgv1): pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight load-value:authinfo-roachprod-2-2: context deadline exceeded
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #131695 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-master]

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-43404

DarrylWong commented 1 day ago

Noting that we're hitting operation "get-user-session" timed out after 10.001s without a node restart. This seems different than the other failures where it is always preceded by a restart.

(this one repros too!)

DarrylWong commented 4 hours ago

Spent a while investigating this (and still am). Appears to be the same issue as here: https://github.com/cockroachdb/cockroach/issues/131695#issuecomment-2394404634

Interestingly, it seems like this seed always hits pq: internal errror while retrieving user account memberships, while other seeds always timeout.

Here is this test's plan:

├── install fixtures for version "v24.2.2" (1)
├── start cluster at version "v24.2.2" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-yjgv1 with binary version v24.2.2 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on mixed-version-tenant-yjgv1 tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-yjgv1 tenant (6)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (7)
├── run startup hooks concurrently
│   ├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant, after 5s delay (8)
│   ├── set cluster setting "kv.snapshot_receiver.excise.enabled" to 'false' on system tenant, after 100ms delay (9)
│   ├── run "maybe enable tenant features", after 500ms delay (10)
│   ├── run "load TPCC dataset", after 5s delay (11)
│   ├── set cluster setting "kv.snapshot_receiver.excise.enabled" to 'true' on system tenant, after 30s delay (12)
│   └── run "load bank dataset", after 100ms delay (13)

Here is another plan I generated that always times out. (The seed is -3495143878866629985 but I also set maxUpgrade(1) and seperate-process only)

├── install fixtures for version "v24.2.2" (1)
├── start cluster at version "v24.2.2" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-2vtpm with binary version v24.2.2 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on mixed-version-tenant-2vtpm tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-2vtpm tenant (6)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (7)
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 0s delay (8)
│   ├── run "load TPCC dataset", after 100ms delay (9)
│   └── run "load bank dataset", after 100ms delay (10)

I changed steps 8, 9, and 12 (cluster setting mutators) to be noops in the first plan. I still hit pq: internal errror while retrieving user account memberships. That means the only difference between this test timing out or it erroring is the timing of the steps.

Not really sure if that means anything, but I think I am comfortable with saying this is probably the same issue as before.