elastic / rally

Macrobenchmarking framework for Elasticsearch
Apache License 2.0
1.91k stars 314 forks source link

Unsupported operand leading to benchmark failure #1860

Closed gbanasiak closed 5 days ago

gbanasiak commented 6 days ago

Noticed with latest Rally version:

$ esrally --version
esrally 2.11.0.dev0 (git revision: f72ea624c7539f2f6672a97c3646423817aa45ed)

When running nyc_taxis track with update challenge the following stack trace was observed:

2024-06-26 06:37:26,439 ActorAddr-(T|:33319)/PID:3235 esrally.driver.driver ERROR Could not execute schedule
Traceback (most recent call last):

  File "/home/esbench/rally/esrally/driver/driver.py", line 1942, in __call__
    service_time = request_end - request_start
                   ~~~~~~~~~~~~^~~~~~~~~~~~~~~

TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

2024-06-26 06:37:26,440 ActorAddr-(T|:33319)/PID:3235 esrally.driver.driver INFO Worker[1] finished executing tasks ['update'] in 870.963661 seconds
2024-06-26 06:37:30,440 ActorAddr-(T|:46501)/PID:3234 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode([https://REDACTED.elastic.cloud:443](https://REDACTED.elastic.cloud/))> has failed for 1 times in a row, putting on 1 second timeout
2024-06-26 06:37:31,410 ActorAddr-(T|:33319)/PID:3235 esrally.actor ERROR Worker[1] has detected a benchmark failure. Notifying master...
Traceback (most recent call last):

  File "/home/esbench/.pyenv/versions/3.11.7/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1782, in __call__
    loop.run_until_complete(self.run())

  File "/home/esbench/.pyenv/versions/3.11.7/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1834, in run
    _ = await asyncio.gather(*awaitables)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1999, in __call__
    raise exceptions.RallyError(f"Cannot run task [{self.task}]: {e}") from None

esrally.exceptions.RallyError: Cannot run task [update]: unsupported operand type(s) for -: 'NoneType' and 'float'

2024-06-26 06:37:31,412 ActorAddr-(T|:38959)/PID:3091 esrally.actor ERROR Main driver received a fatal exception from a load generator. Shutting down.
gbanasiak commented 6 days ago

After adding the following ad-hoc logs:

$ git diff
diff --git a/esrally/client/factory.py b/esrally/client/factory.py
index 2b5b99e3..950718b0 100644
--- a/esrally/client/factory.py
+++ b/esrally/client/factory.py
@@ -219,9 +219,18 @@ class EsClientFactory:
                     return super().loads(data)

         async def on_request_start(session, trace_config_ctx, params):
+            self.logger.debug("on_request_start")
             RallyAsyncElasticsearch.on_request_start()

         async def on_request_end(session, trace_config_ctx, params):
+            self.logger.debug("on_request_end")
+
+        async def on_response_chunk_received(session, trace_config_ctx, params):
+            self.logger.debug("on_response_chunk_received")
+            RallyAsyncElasticsearch.on_request_end()
+
+        async def on_request_exception(session, trace_config_ctx, params):
+            self.logger.debug("on_request_exception")
             RallyAsyncElasticsearch.on_request_end()

         trace_config = aiohttp.TraceConfig()
@@ -235,9 +244,11 @@ class EsClientFactory:
         # Rally's implementation of the `on_request_end` callback handler updates the timestamp on every call, Rally
         # will ultimately record the time when it received the *last* chunk. This is what we want because any code
         # that is using the Elasticsearch client library can only act on the response once it is fully received.
-        trace_config.on_response_chunk_received.append(on_request_end)
+        trace_config.on_response_chunk_received.append(on_response_chunk_received)
+        # for logging purposes
+        trace_config.on_request_end.append(on_request_end)
         # ensure that we also stop the timer when a request "ends" with an exception (e.g. a timeout)
-        trace_config.on_request_exception.append(on_request_end)
+        trace_config.on_request_exception.append(on_request_exception)

         # override the builtin JSON serializer
         self.client_options["serializer"] = LazyJSONSerializer()

And configuring logging.json as follows:

  "loggers": {
    "elasticsearch": {
      "handlers": [
        "rally_log_handler"
      ],
      "level": "DEBUG",
      "propagate": false
    },
    "rally.profile": {
      "handlers": [
        "rally_profile_handler"
      ],
      "level": "INFO",
      "propagate": false
    },
    "elastic_transport": {
      "handlers": [
        "rally_log_handler"
      ],
      "level": "DEBUG",
      "propagate": false
    },
    "esrally.client.factory": {
      "handlers": [
        "rally_log_handler"
      ],
      "level": "DEBUG",
      "propagate": false
    }
  }

The following sequences were captured:

# healthy request
2024-06-26 12:54:19,209 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_request_start
2024-06-26 12:54:26,457 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_request_end
2024-06-26 12:54:26,473 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_response_chunk_received
2024-06-26 12:54:26,474 -not-actor-/PID:26852 elastic_transport.node DEBUG > PUT /_bulk HTTP/1.1
[.. request / response ..]
2024-06-26 12:54:26,485 -not-actor-/PID:26852 elastic_transport.transport INFO PUT [https://REDACTED.elastic.cloud:443/_bulk](https://REDACTED.elastic.cloud/_bulk) [status:200 duration:7.276s]

# with timeout and on_request_exception
2024-06-26 12:54:26,517 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_request_start
2024-06-26 12:54:37,440 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_request_exception
2024-06-26 12:54:37,441 -not-actor-/PID:26852 elastic_transport.node DEBUG > PUT /_bulk HTTP/?.?
[.. request ..]
elastic_transport.ConnectionTimeout: Connection timeout caused by: TimeoutError()
2024-06-26 12:54:37,447 -not-actor-/PID:26852 elastic_transport.transport INFO PUT [https://REDACTED.elastic.cloud:443/_bulk](https://REDACTED.elastic.cloud/_bulk) [status:N/A duration:10.930s]

# with timeout but without on_request_exception
2024-06-26 12:54:48,479 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_request_start
2024-06-26 12:54:59,438 ActorAddr-(T|:38747)/PID:26852 esrally.client.factory DEBUG on_request_end
2024-06-26 12:54:59,440 -not-actor-/PID:26852 elastic_transport.node DEBUG > PUT /_bulk HTTP/?.?
[.. request ..]
elastic_transport.ConnectionTimeout: Connection timeout caused by: TimeoutError()
2024-06-26 12:54:59,445 -not-actor-/PID:26852 elastic_transport.transport INFO PUT [https://REDACTED.elastic.cloud:443/_bulk](https://REDACTED.elastic.cloud/_bulk) [status:N/A duration:10.966s]

This suggests that some timeouts lead to end state instead of exception state in aiohttp (see here). After https://github.com/elastic/rally/pull/1823 end state is no longer handled by Rally as callback for on_request_end was removed in favor of on_response_chunk_received callback.

gbanasiak commented 6 days ago

I'm checking whether on_request_end callback restoration while keeping on_response_chunk_received is enough to eliminate the bug.

gbanasiak commented 5 days ago

Test results looked fine, so I've raised a fix in https://github.com/elastic/rally/pull/1861.