cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

roachtest: replicate/wide failed #135270

Open cockroach-teamcity opened 2 weeks ago

cockroach-teamcity commented 2 weeks ago

roachtest.replicate/wide failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:

(cluster.go:2343).Start: ~ COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?options=-ccluster%3Dsystem&sslcert=.%2Fcerts%2Fclient.root.crt&sslkey=.%2Fcerts%2Fclient.root.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-17772379-1731652882-136-n9cpu1/system/1731675828273968027?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
failed to connect to ``host=localhost user=root database=``: tls error (read tcp 127.0.0.1:52616->127.0.0.1:26257: i/o timeout)
Failed running "sql": COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-44430

blathers-crl[bot] commented 2 weeks ago

cc @cockroachdb/test-eng

kvoli commented 2 weeks ago

EDIT: Ignore what I say below, see @DarrylWong's comment.

This is a metamorphic issue, where the test is setting the number of CPUs to 2 (and corresponding mem). I don't imagine it is very surprising that things failed (every node crashed due to OOM):

"machineType": "https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-east1-c/machineTypes/n2-standard-2",

and:

params: 2024/11/15 13:35:16 test_runner.go:2085: Roachtest Parameters:
{
 "arch": "amd64",
 "cloud": "gce",
 "coverageBuild": "false",
 "cpu": "1", <--------------- probably not going to succeed anytime soon
 "encrypted": "false",
 "fs": "ext4",
 "localSSD": "true",
 "metamorphicLeases": "leader",
 "runtimeAssertionsBuild": "false",
 "ssd": "0"
}

@cockroachdb/test-eng how can test opt out of these configurations, specifically tests which are "heavy" and have little hope of passing with small CPU/mem?

DarrylWong commented 2 weeks ago

As far as I'm aware, we don't randomize cpu at all. The test itself is setting 1 CPU here. replicate/wide has been running on 1 CPU for at least a couple years now [commit from 2021]. So I don't think this is a metamorphic issue, unless there's a reason you believe otherwise?

kvoli commented 2 weeks ago

Oh, in that case ignore me entirely. I'll reassign myself. Apologies for the confusion.

kvoli commented 2 weeks ago

I'm really not too sure whats going on here. The processes exit at 12:57, as the test wants to stop them anyway, but w/ a 137 OOM. The test itself doesn't fail until 13:23 (approx.). The profiles aren't that interesting. One thing that stood out is that this is leader leases enabled and we see:

W241115 12:57:29.355987 195 kv/kvserver/replica_store_liveness.go:66 â‹® [T1,Vsystem,n9,s9,r45/14:{-}] 378  store not found for replica 1 in SupportFor

I don't have much to go off of, so I'm going to assume it is somehow related to leader leases and re-assign to some experts @miraradeva and @arulajmani.

cockroach-teamcity commented 1 week ago

roachtest.replicate/wide failed with artifacts on master @ eb2d2e19eb29d2747d9e267bd0612a69d066adad:

(cluster.go:2343).Start: ~ COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?options=-ccluster%3Dsystem&sslcert=.%2Fcerts%2Fclient.root.crt&sslkey=.%2Fcerts%2Fclient.root.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-17868467-1732171355-117-n9cpu1/system/1732194263682404152?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
failed to connect to ``host=localhost user=root database=``: tls error (read tcp 127.0.0.1:49080->127.0.0.1:26257: i/o timeout)
Failed running "sql": COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity commented 5 days ago

roachtest.replicate/wide failed with artifacts on master @ 67caf19d3998bb3ca1ada7e3c14486d505b68012:

(cluster.go:2343).Start: ~ COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true COCKROACH_CONNECT_TIMEOUT=1200 ./cockroach sql --url 'postgres://root@localhost:26257?options=-ccluster%3Dsystem&sslcert=.%2Fcerts%2Fclient.root.crt&sslkey=.%2Fcerts%2Fclient.root.key&sslmode=verify-full&sslrootcert=.%2Fcerts%2Fca.crt' -e "CREATE SCHEDULE IF NOT EXISTS test_only_backup FOR BACKUP INTO 'gs://cockroachdb-backup-testing/roachprod-scheduled-backups/teamcity-17918524-1732603257-132-n9cpu1/system/1732625644035447369?AUTH=implicit' RECURRING '*/15 * * * *' FULL BACKUP '@hourly' WITH SCHEDULE OPTIONS first_run = 'now'"
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.
failed to connect to ``host=localhost user=root database=``: tls error (read tcp 127.0.0.1:35254->127.0.0.1:26257: i/o timeout)
Failed running "sql": COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!