workload: gracefully handle down'd nodes during init

nicktrav commented 1 year ago

Describe the problem

Currently, if the first node provided to the workload generator is unavailable, the workload command will error out trying to perform the init commands (i.e. when pre-splitting ranges).

Apr 19 21:40:35 grinaker-231-0009 values.sh[1928882]: I230419 21:40:35.926145 1 workload/cli/run.go:397  [-] 16  retrying after error during init: executing ALTER TABLE kv SPLIT AT VALUES (-5592993832538310608): EOF

To Reproduce

Run the workload command against a cluster, with the node corresponding to the first in the list of URIs provided to the command unavailable / down.

There is an example of a KV command in this internal doc here.

Expected behavior

The workload command should be able to pick another node to use as the "admin" node.

Jira issue: CRDB-27170

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

srosenberg commented 1 week ago

We haven't seen this failure mode in any of the nightlies or DRT clusters in a (long) while. For future reference, we may need to look into a load-balanced pgurl, which long-running clusters already support (via server-side LB).

cockroachdb / cockroach

workload: gracefully handle down'd nodes during init #101879