Single-node clusters use non-trivial discovery config when they shouldn't

danielmitterdorfer commented 1 year ago

Rally version (get with esrally --version): esrally 2.7.1.dev0 (git revision: 4e1335ee07e16a4af0de1a0b543a86595ca37803) (current master branch)

Description of the problem including expected versus actual behavior:

When running a benchmark with Rally against a single node cluster, the Elasticsearch logs contain repeated warning messages like the following:

[2023-01-04T09:39:10,994][WARN ][o.e.c.c.Coordinator      ] [rally-node-0] This node is a fully-formed single-node cluster with cluster UUID [o1zdVpJiSMOVfeSz9nye_A], but it is configured as if to discover other nodes and form a multi-node cluster via the [discovery.seed_hosts=[127.0.0.1]] setting. Fully-formed clusters do not attempt to discover other nodes, and nodes with different cluster UUIDs cannot belong to the same cluster. The cluster UUID persists across restarts and can only be changed by deleting the contents of the node's data path(s). Remove the discovery configuration to suppress this message.

This indicates that the discovery configuration is not correct, see https://github.com/elastic/elasticsearch/issues/85222 for full details.

Steps to reproduce:

esrally race --track=geonames --challenge=append-no-conflicts-index-only --distribution-version=8.5.3
Inspect the corresponding Elasticsearch log

This can be fixed by not rendering the following line in rally-teams when there is only one node:

https://github.com/elastic/rally-teams/blob/74f96e3fab247e0c31d253408e4affd696a50eff/cars/v1/vanilla/templates/config/elasticsearch.yml#L72

However, as we pass the node ips as a pre-rendered string in the provisioner we cannot use Jinja's length filter to determine the number of items (would return the string length instead):

https://github.com/elastic/rally/blob/4c7141aac0179c38c08c134c25f7a23e6f07d2d9/esrally/mechanic/provisioner.py#L320

We could instead either provide the node ips as list or instead add an additional variable in the provisioner that contains the node count, which can then be used in the template in rally-teams. The latter is even backwards-compatible if we check for the existence of the template variable before evaluating it.

inqueue commented 1 year ago

Thanks, @danielmitterdorfer. I will submit a PR proposal soon.

pquentin commented 1 year ago

This can be fixed by not rendering the following line in rally-teams when there is only one node:

We discovered the hard way that it's not about the number of nodes since https://github.com/elastic/rally-teams/pull/77 broke our nightly benchmarks with this error:

[2023-01-18T11:56:56,877][INFO ][o.e.t.TransportService ] [rally-node-0] publish_address {192.168.20.29:9300}, bound_addresses {192.168.20.29:9300}
[2023-01-18T11:56:56,985][INFO ][o.e.b.BootstrapChecks ] [rally-node-0] bound or publishing to a non-loopback address, enforcing bootstrap checks
    bootstrap check failure [1] of [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured

It took me a lot of time to realize that this was an error even though this is an INFO log. I reverted in https://github.com/elastic/rally-teams/pull/78 until we find a better solution.

Indeed, the way Elasticsearch differentiates development vs. production mode is the address that Elasticsearch is bound to, not the number of nodes in the cluster: https://www.elastic.co/guide/en/elasticsearch/reference/8.6/bootstrap-checks.html#dev-vs-prod-mode. And on nightly benchmark that address is not localhost. So I suppose we could look at the ips and see if it's a loopback address, but we would have to get that check right (including 127.0.0.1, ::1, localhost and probably others? not sure what Elasticsearch does).

Or we could live with this warning.

elastic / rally

Single-node clusters use non-trivial discovery config when they shouldn't #1644