Open dliappis opened 2 years ago
Would love to have an environment config that reliably exacerbates this problem. I'm very curious about the impact of http_compression, for instance, on the performance of the metrics store activity and the resultant backpressure on the metrics messages
The issue reproduces reliably with two large-ish load drivers during the standalone-query-string-search
operation of the wikipedia track. On every occurrence I've observed:
standalone_search_clients: 31
In various cases, e.g. when using lots of clients, or collecting lots of metrics via telemetry devices, we have witnessed hanging benchmarks. In summary, the observation is that either when too many log messages or too many metrics are sent, the benchmark may end up hanging.
Example error message can be (from
~/.rally/logs/actor-system-internal.log
):whereas the actual
~/.rally/logs/rally.log
for the same actor shows:So the worker did reach the join point but it could not inform the Driver actor about it because the message could not be sent (it timed out 5 minutes after at
20:48:38
). One reason for this behavior is that the actor system's internal message queue was full. Unfortunately the queue length is not configurable (see MAX_QUEUED_TRANSMITS).This issue is about what we can do to improve resiliency while still using the actor system.
A few ideas we could explore are: