elastic / rally

Macrobenchmarking framework for Elasticsearch
Apache License 2.0
1.91k stars 313 forks source link

Prevent benchmark locks due to actor system internal message queue getting full #1448

Open dliappis opened 2 years ago

dliappis commented 2 years ago

In various cases, e.g. when using lots of clients, or collecting lots of metrics via telemetry devices, we have witnessed hanging benchmarks. In summary, the observation is that either when too many log messages or too many metrics are sent, the benchmark may end up hanging.

Example error message can be (from ~/.rally/logs/actor-system-internal.log):

...
2021-08-04 20:48:38.779581 p7907 Warn Transmit attempt from ActorAddr-(T|:41611) to ActorAddr-(T|:40337) timed out, returning PoisonPacket
2021-08-04 20:48:38.779778 p7907 I    completion error: ************* TransportIntent(ActorAddr-(T|:40337)-=-SENDSTS_FAILED-WAITSLOT-<class 'esrally.driver.driver.UpdateSamples'>-<esrally.driver.driver.UpdateSamples object at 0x7fb337558d30>-quit_0:00:00)
2021-08-04 20:48:38.779914 p7907 I    bad internal file descriptor!
2021-08-04 20:48:38.779968 p7907 I    wrecv 1168 is bad
2021-08-04 20:48:38.780125 p7907 I    bad internal file descriptor!
2021-08-04 20:48:38.780180 p7907 I    wrecv 1168 is bad
2021-08-04 20:48:38.780243 p7907 Warn Transmit attempt from ActorAddr-(T|:41611) to ActorAddr-(T|:40337) timed out, returning PoisonPacket
2021-08-04 20:48:38.780369 p7907 I    completion error: ************* TransportIntent(ActorAddr-(T|:40337)-=-SENDSTS_FAILED-WAITSLOT-<class 'esrally.driver.driver.JoinPointReached'>-<esrally.driver.driver.JoinPointReached object at 0x7fb337dd2610>-quit_0:00:00)

whereas the actual ~/.rally/logs/rally.log for the same actor shows:

2021-08-04 20:43:36,649 ActorAddr-(T|:41611)/PID:7907 esrally.driver.driver INFO Total run duration: 16205.111997 seconds.
2021-08-04 20:43:36,650 ActorAddr-(T|:41611)/PID:7907 esrally.driver.driver INFO Total time to shutdown asyncgens: 0.000340 seconds.
2021-08-04 20:43:36,652 ActorAddr-(T|:41611)/PID:7907 esrally.driver.driver INFO Total time to close transports: -0.001832 seconds.
2021-08-04 20:43:38,779 ActorAddr-(T|:41611)/PID:7907 esrally.actor INFO Worker[22] is ready for the next task.
2021-08-04 20:43:38,779 ActorAddr-(T|:41611)/PID:7907 esrally.actor INFO Worker[22] reached join point at index [8].

So the worker did reach the join point but it could not inform the Driver actor about it because the message could not be sent (it timed out 5 minutes after at 20:48:38). One reason for this behavior is that the actor system's internal message queue was full. Unfortunately the queue length is not configurable (see MAX_QUEUED_TRANSMITS).

This issue is about what we can do to improve resiliency while still using the actor system.

A few ideas we could explore are:

  1. Create a PR in the upstream project to support configurable internal queue size (MAX_QUEUED_TRANSMITS)
  2. Explore whether we can retry messages that time out
  3. Explore whether we can be more lenient with messages that aren't important and allowing the benchmark to proceed (e.g. messages related to telemetry device metrics)
  4. Improve logging so that we can clearly understand whether a failed message jammed a benchmark at a specific task
DJRickyB commented 2 years ago

Would love to have an environment config that reliably exacerbates this problem. I'm very curious about the impact of http_compression, for instance, on the performance of the metrics store activity and the resultant backpressure on the metrics messages

inqueue commented 5 months ago

The issue reproduces reliably with two large-ish load drivers during the standalone-query-string-search operation of the wikipedia track. On every occurrence I've observed: