Fix exception in task executor where task fails but doesn't set request_context.request_end

ty-elastic commented 4 months ago

Under "high" rally load (~10Gbit/s using the elastic/logs track with 1000 bulk indexing clients), I can consistently get Rally to crash within some minutes. Looking at the logs, it seems to happen when the target ES momentarily cannot accommodate the load (e.g., returns 429s). The logs prior to the exception will be like:

2024-06-26 21:45:12,264 ActorAddr-(T|:41695)/PID:252988 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has failed for 1 times in a row, putting on 1 second timeout
2024-06-26 21:45:12,790 ActorAddr-(T|:40017)/PID:252978 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:12,798 ActorAddr-(T|:46313)/PID:252969 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:12,853 ActorAddr-(T|:41695)/PID:252988 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:12,861 ActorAddr-(T|:41695)/PID:252988 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:13,373 ActorAddr-(T|:46313)/PID:252969 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has been marked alive after a successful request
2024-06-26 21:45:13,645 ActorAddr-(T|:37579)/PID:252985 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has failed for 1 times in a row, putting on 1 second timeout
2024-06-26 21:45:13,680 ActorAddr-(T|:43345)/PID:252987 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://221.229.133.34.bc.googleusercontent.com:9200)> has failed for 1 times in a row, putting on 1 second timeout

thereafter, Rally will stop with an exception:

TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
esrally.driver.driver ERROR Could not execute schedule

Looking at a stack trace, this exception is generated because:

  File "/rally/esrally/driver/driver.py", line 1948, in __call__
    service_time = request_end - request_start
                   ~~~~~~~~~~~~^~~~~~~~~~~~~~~

where request_end is unexpectedly None.

Looking at the code, I think the intent is that request_context.request_end should always be set, even on error scenarios. Perhaps there is some race condition with the HTTP request lib, though, where for some reason on_request_end() is never called, leaving request_context.request_end unset.

This PR doesn't try to address the root cause (why is request_context.request_end unset), but rather acknowledges that it is unset (only in already acknowledged error scenarios, i.e., request_meta_data["success"] == False), and sets it. My assumption is that because request_meta_data["success"] == False and because it can be False at this point (with request_end set to something), that some other code down the line will handle counting the error and what we need to focus on here is just not stopping overall execution (due to an unplanned exception) on an otherwise non-fatal error..

rally_log.txt rally_config.txt

ty-elastic commented 4 months ago

alternatively, you could perhaps set request_context.request_end the first time you set request_context.request_start, since obviously the intent is that request_end can be continually updated.

ty-elastic commented 4 months ago

nm. this is correctly addressed here, I believe.

ty-elastic commented 4 months ago

fixed in https://github.com/elastic/rally/pull/1861

elastic / rally

Fix exception in task executor where task fails but doesn't set request_context.request_end #1864