stats telemetry stops collecting after it encounters a server error

Rally version (get with esrally --version): esrally 2.9.0.dev0 (git revision: 50ebcb68d9f09de545a1bfb217fc9840b97a367e)

esrally race --pipeline=benchmark-only --track-repository="default" --track="nyc_taxis" --challenge="autoscale" --telemetry='["node-stats", "shard-stats", "blob-store-stats"]' --on-error="continue" --target-hosts=target-hosts.json --client-options=client-options.json --track-params=track-params.json --telemetry-params=telemetry-params.json --user-tags=user-tags.json --race-id=c5420fb2-d073-4a6f-a54a-f98244e9b74b --load-driver-hosts=127.0.0.1

Description of the problem including expected versus actual behavior:

Rally will stop retrying to collect stats telemetry once it has failed too many times.

At the time of the last stats collection attempt, the benchmark showed a steady and prolonged increase in average bulk indexing latency.
Rally recorded 0 bulk indexing failures, though indexing throughput dropped significantly.
Subsequent manual stats calls to the cluster were successful.

Provide logs (if relevant):

2023-08-25 16:30:33,699 ActorAddr-(T|:45481)/PID:7942 esrally.telemetry ERROR Could not determine master node stats
Traceback (most recent call last):

  File "~/rally/esrally/telemetry.py", line 172, in run
    self.recorder.record()

  File "~/rally/esrally/telemetry.py", line 2249, in record
    info = self.client.nodes.info(node_id=state["master_node"], metric="os")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
    return api(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^

  File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/nodes.py", line 249, in info
    return self.perform_request(  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 390, in perform_request
    return self._client.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "~/rally/esrally/client/synchronous.py", line 226, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(message=message, meta=meta, body=resp_body)

elasticsearch.ApiError: ApiError(503, "{'ok': False, 'message': 'The requested resource is currently unavailable.'}")

The benchmark was using the default node-stats-sample-interval of 1s. One second seems aggressive, and I will try with a value of 10s. We might consider a new default.

elastic / rally

stats telemetry stops collecting after it encounters a server error #1771