Description of the problem including expected versus actual behavior:
Rally will stop retrying to collect stats telemetry once it has failed too many times.
At the time of the last stats collection attempt, the benchmark showed a steady and prolonged increase in average bulk indexing latency.
Rally recorded 0 bulk indexing failures, though indexing throughput dropped significantly.
Subsequent manual stats calls to the cluster were successful.
Provide logs (if relevant):
2023-08-25 16:30:33,699 ActorAddr-(T|:45481)/PID:7942 esrally.telemetry ERROR Could not determine master node stats
Traceback (most recent call last):
File "~/rally/esrally/telemetry.py", line 172, in run
self.recorder.record()
File "~/rally/esrally/telemetry.py", line 2249, in record
info = self.client.nodes.info(node_id=state["master_node"], metric="os")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
return api(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/nodes.py", line 249, in info
return self.perform_request( # type: ignore[return-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 390, in perform_request
return self._client.perform_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/rally/esrally/client/synchronous.py", line 226, in perform_request
raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(message=message, meta=meta, body=resp_body)
elasticsearch.ApiError: ApiError(503, "{'ok': False, 'message': 'The requested resource is currently unavailable.'}")
The benchmark was using the default node-stats-sample-interval of 1s. One second seems aggressive, and I will try with a value of 10s. We might consider a new default.
Rally version (get with
esrally --version
):esrally 2.9.0.dev0 (git revision: 50ebcb68d9f09de545a1bfb217fc9840b97a367e)
Description of the problem including expected versus actual behavior:
Rally will stop retrying to collect stats telemetry once it has failed too many times.
Provide logs (if relevant):
The benchmark was using the default
node-stats-sample-interval
of1s
. One second seems aggressive, and I will try with a value of10s
. We might consider a new default.