elastic / rally

Macrobenchmarking framework for Elasticsearch
Apache License 2.0
1.91k stars 314 forks source link

Service time measurement stops at first response byte #1822

Closed danielmitterdorfer closed 5 months ago

danielmitterdorfer commented 5 months ago

Rally version (get with esrally --version): esrally 2.10.0.dev0 (git revision: df49d4d2c45d3dbcd2e6791524b0c5812430105e)

Invoked command: N/A

Configuration file (located in ~/.rally/rally.ini)): N/A

JVM version: N/A

OS version: N/A

Description of the problem including expected versus actual behavior:

Rally uses aiohttp tracing callbacks to measure service time, specifically:

The state diagram in the aiohttp docs suggests that on_request_end is the callback that is called when the response has been completely received, however, looking at the implementation in aiohttp, this is not correct. Instead, this callback is invoked when the first byte (or initial chunk) of the response is received, see (initial await and callback invocation) but not when the response body has been read. This happens in higher-level code (see aio implementation of elastic-transport that reads the body).

Consequently, Rally stops the timer when the first byte of the response is received, instead of the entire response. As any code that uses the Elasticsearch client cannot make any progress before the body has been received, service time measurement is too "optimistic" at the moment. In many cases the difference will be small but if the response is large this can make a difference of several hundred milliseconds.

Steps to reproduce:

  1. Run a Rally track that produces large responses (e.g. Universal Profiling, only available internally at the moment)
  2. Observe the time difference between using the current callback and the fixed version (see below)

Fix

The following snippet fixes the issue:

diff --git a/esrally/client/factory.py b/esrally/client/factory.py
index 4a4c0179..daaaeeee 100644
--- a/esrally/client/factory.py
+++ b/esrally/client/factory.py
@@ -225,7 +225,7 @@ class EsClientFactory:

         trace_config = aiohttp.TraceConfig()
         trace_config.on_request_start.append(on_request_start)
-        trace_config.on_request_end.append(on_request_end)
+        trace_config.on_response_chunk_received.append(on_request_end)
         # ensure that we also stop the timer when a request "ends" with an exception (e.g. a timeout)
         trace_config.on_request_exception.append(on_request_end)

The fix works because Rally continuously updates the end time on each callback invocation so in the end, the timestamp corresponds to the timestamp when the last chunk has been received.