Open cockroach-teamcity opened 2 weeks ago
cc @cockroachdb/test-eng
EDIT: Ignore what I say below, see @DarrylWong's comment.
This is a metamorphic issue, where the test is setting the number of CPUs to 2 (and corresponding mem). I don't imagine it is very surprising that things failed (every node crashed due to OOM):
"machineType": "https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-east1-c/machineTypes/n2-standard-2",
and:
params: 2024/11/15 13:35:16 test_runner.go:2085: Roachtest Parameters:
{
"arch": "amd64",
"cloud": "gce",
"coverageBuild": "false",
"cpu": "1", <--------------- probably not going to succeed anytime soon
"encrypted": "false",
"fs": "ext4",
"localSSD": "true",
"metamorphicLeases": "leader",
"runtimeAssertionsBuild": "false",
"ssd": "0"
}
@cockroachdb/test-eng how can test opt out of these configurations, specifically tests which are "heavy" and have little hope of passing with small CPU/mem?
As far as I'm aware, we don't randomize cpu at all. The test itself is setting 1 CPU here. replicate/wide
has been running on 1 CPU for at least a couple years now [commit from 2021]. So I don't think this is a metamorphic issue, unless there's a reason you believe otherwise?
Oh, in that case ignore me entirely. I'll reassign myself. Apologies for the confusion.
I'm really not too sure whats going on here. The processes exit at 12:57, as the test wants to stop them anyway, but w/ a 137
OOM. The test itself doesn't fail until 13:23 (approx.). The profiles aren't that interesting. One thing that stood out is that this is leader leases enabled and we see:
W241115 12:57:29.355987 195 kv/kvserver/replica_store_liveness.go:66 â‹® [T1,Vsystem,n9,s9,r45/14:{-}] 378 store not found for replica 1 in SupportFor
I don't have much to go off of, so I'm going to assume it is somehow related to leader leases and re-assign to some experts @miraradeva and @arulajmani.
roachtest.replicate/wide failed with artifacts on master @ eb2d2e19eb29d2747d9e267bd0612a69d066adad:
(cluster.go:2343).Start: ~ COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?options=-ccluster%3Dsystem&sslcert=.%2Fcerts%2Fclient.root.crt&sslkey=.%2Fcerts%2Fclient.root.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-17868467-1732171355-117-n9cpu1/system/1732194263682404152?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
failed to connect to ``host=localhost user=root database=``: tls error (read tcp 127.0.0.1:49080->127.0.0.1:26257: i/o timeout)
Failed running "sql": COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/replicate/wide/run_1
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=1
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=leader
runtimeAssertionsBuild=false
ssd=0
See: roachtest README
See: How To Investigate (internal)
See: Grafana
roachtest.replicate/wide failed with artifacts on master @ 67caf19d3998bb3ca1ada7e3c14486d505b68012:
(cluster.go:2343).Start: ~ COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?options=-ccluster%3Dsystem&sslcert=.%2Fcerts%2Fclient.root.crt&sslkey=.%2Fcerts%2Fclient.root.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-17918524-1732603257-132-n9cpu1/system/1732625644035447369?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
failed to connect to ``host=localhost user=root database=``: tls error (read tcp 127.0.0.1:35254->127.0.0.1:26257: i/o timeout)
Failed running "sql": COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/replicate/wide/run_1
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=1
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=leader
runtimeAssertionsBuild=false
ssd=0
See: roachtest README
See: How To Investigate (internal)
See: Grafana
roachtest.replicate/wide failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=1
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=leader
runtimeAssertionsBuild=false
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/kv-triageThis test on roachdash | Improve this report!
Jira issue: CRDB-44430