Open cockroach-teamcity opened 2 days ago
Noting that we're hitting operation "get-user-session" timed out after 10.001s
without a node restart. This seems different than the other failures where it is always preceded by a restart.
(this one repros too!)
Spent a while investigating this (and still am). Appears to be the same issue as here: https://github.com/cockroachdb/cockroach/issues/131695#issuecomment-2394404634
Interestingly, it seems like this seed always hits pq: internal errror while retrieving user account memberships
, while other seeds always timeout.
Here is this test's plan:
├── install fixtures for version "v24.2.2" (1)
├── start cluster at version "v24.2.2" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-yjgv1 with binary version v24.2.2 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on mixed-version-tenant-yjgv1 tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-yjgv1 tenant (6)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (7)
├── run startup hooks concurrently
│ ├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant, after 5s delay (8)
│ ├── set cluster setting "kv.snapshot_receiver.excise.enabled" to 'false' on system tenant, after 100ms delay (9)
│ ├── run "maybe enable tenant features", after 500ms delay (10)
│ ├── run "load TPCC dataset", after 5s delay (11)
│ ├── set cluster setting "kv.snapshot_receiver.excise.enabled" to 'true' on system tenant, after 30s delay (12)
│ └── run "load bank dataset", after 100ms delay (13)
Here is another plan I generated that always times out. (The seed is -3495143878866629985 but I also set maxUpgrade(1) and seperate-process only)
├── install fixtures for version "v24.2.2" (1)
├── start cluster at version "v24.2.2" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-2vtpm with binary version v24.2.2 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on mixed-version-tenant-2vtpm tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-2vtpm tenant (6)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (7)
├── run startup hooks concurrently
│ ├── run "maybe enable tenant features", after 0s delay (8)
│ ├── run "load TPCC dataset", after 100ms delay (9)
│ └── run "load bank dataset", after 100ms delay (10)
I changed steps 8, 9, and 12 (cluster setting mutators) to be noops in the first plan. I still hit pq: internal errror while retrieving user account memberships
. That means the only difference between this test timing out or it erroring is the timing of the steps.
Not really sure if that means anything, but I think I am comfortable with saying this is probably the same issue as before.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on release-24.3 @ 4cbedefd790c75cb0f21f77ed8d917c8528a7d15:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
- #131695 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-master]
/cc @cockroachdb/test-eng
This test on roachdash | Improve this report!
Jira issue: CRDB-43404