Closed b-deam closed 12 months ago
Hmm, actually turns out there's some scenario where we get a successful request, but the response body doesn't contain an empty nodes
key. I can't repro this locally:
$ curl -k -u elastic:changeme -X GET "https://localhost:9200/_tasks?actions=indices:admin/forcemerge&pretty"
{
"nodes" : { }
}
So I've got a reproduction env setup and re-running with the elastic_transport
logger set to DEBUG
.
Stack trace that triggered this PR:
2023-11-19 20:04:38,50 ActorAddr-(T|:39957)/PID:9595 esrally.driver.driver ERROR Cannot execute runner [user-defined context-manager enabled runner for [force-merge]]; most likely due to missing parameters.
Traceback (most recent call last):
File "/home/esbench/rally/esrally/driver/driver.py", line 2014, in execute_single
return_value = await runner(es, params)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 297, in __call__
return await self.delegate(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 412, in __call__
return_value = await self.delegate(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 345, in __call__
return await self.delegate(self.client_extractor(args[0]), *args[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 697, in __call__
if len(tasks["nodes"]) == 0:
~~~~~^^^^^^^^^
File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_response.py", line 188, in __getitem__
return self.body[item] # type: ignore[index]
~~~~~~~~~^^^^^^
KeyError: 'nodes'
2023-11-19 20:04:38,52 ActorAddr-(T|:39957)/PID:9595 esrally.driver.driver ERROR Could not execute schedule
Traceback (most recent call last):
File "/home/esbench/rally/esrally/driver/driver.py", line 2014, in execute_single
return_value = await runner(es, params)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 297, in __call__
return await self.delegate(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 412, in __call__
return_value = await self.delegate(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 345, in __call__
return await self.delegate(self.client_extractor(args[0]), *args[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 697, in __call__
if len(tasks["nodes"]) == 0:
~~~~~^^^^^^^^^
File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_response.py", line 188, in __getitem__
return self.body[item] # type: ignore[index]
~~~~~~~~~^^^^^^
KeyError: 'nodes'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/esbench/rally/esrally/driver/driver.py", line 1920, in __call__
total_ops, total_ops_unit, request_meta_data = await execute_single(runner, self.es, params, self.on_error)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 2081, in execute_single
raise exceptions.SystemSetupError(msg)
esrally.exceptions.SystemSetupError: Cannot execute [user-defined context-manager enabled runner for [force-merge]]. Provided parameters are: ['index', 'max-num-segments', 'mode', 'poll-period', 'request-timeout', 'headers', 'opaque-id', 'operation-type']. Error: ['nodes'].
2023-11-19 20:04:38,53 ActorAddr-(T|:39957)/PID:9595 esrally.driver.driver INFO Worker[0] finished executing tasks ['force-merge-data-streams'] in 1204.683508 seconds
2023-11-19 20:04:39,686 ActorAddr-(T|:39957)/PID:9595 esrally.actor ERROR Worker[0] has detected a benchmark failure. Notifying master...
Traceback (most recent call last):
File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 1765, in __call__
loop.run_until_complete(self.run())
File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 1817, in run
_ = await asyncio.gather(*awaitables)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 1982, in __call__
raise exceptions.RallyError(f"Cannot run task [{self.task}]: {e}") from None
esrally.exceptions.RallyError: Cannot run task [force-merge-data-streams]: Cannot execute [user-defined context-manager enabled runner for [force-merge]]. Provided parameters are: ['index', 'max-num-segments', 'mode', 'poll-period', 'request-timeout', 'headers', 'opaque-id', 'operation-type']. Error: ['nodes'].
Exception is raised in here and caught here. Based on this we know the exception is KeyError
so my interpretation matches description from https://github.com/elastic/rally/pull/1804#issuecomment-1818341840. If that's correct the added exception handlers will not catch anything. Do you agree @b-deam ?
Exception is raised in here and caught here. Based on this we know the exception is
KeyError
so my interpretation matches description from #1804 (comment). If that's correct the added exception handlers will not catch anything. Do you agree @b-deam ?
I should have left this in draft because I've been trying to reproduce the issue, but I agree, and I came to the same conclusion as you did - I just couldn't see how a successful response would be missing nodes
property, and I saw this error in the logs, so I threw out a hail mary to see if we were somehow swallowing a client exception:
2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://10.132.0.63:9200)> has been marked alive after a successful request
2023-11-21 03:16:14,821 ActorAddr-(T|:38803)/PID:8358 esrally.driver.driver ERROR Cannot execute runner [user-defined context-manager enabled runner for [force-merge]]; most likely due to missing parameters.
Traceback (most recent call last):
But, I've now managed to reproduce the issue and I'm squarely back in the ❓ camp.. as it looks like it's possible the API response has changed?
2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.node DEBUG > GET /_tasks?actions=indices%3Aadmin%2Fforcemerge HTTP/1.1
> Accept: application/vnd.elasticsearch+json; compatible-with=8
> Authorization: Basic <hidden>
> Connection: keep-alive
> User-Agent: elasticsearch-py/8.6.1 (Python/3.11.4; elastic-transport/8.4.1)
> X-Elastic-Client-Meta: es=8.6.1,py=3.11.4,t=8.4.1,ai=3.9.0
< HTTP/1.1 200 OK
< Content-Length: 55
< Content-Type: application/vnd.elasticsearch+json;compatible-with=8
< X-Elastic-Product: Elasticsearch
< {"_shards":{"total":3250,"successful":3250,"failed":0}}
2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.transport INFO GET https://10.132.0.63:9200/_tasks?actions=indices%3Aadmin%2Fforcemerge [status:200 duration:12.160s]
2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://10.132.0.63:9200)> has been marked alive after a successful request
2023-11-21 03:16:14,821 ActorAddr-(T|:38803)/PID:8358 esrally.driver.driver ERROR Cannot execute runner [user-defined context-manager enabled runner for [force-merge]]; most likely due to missing parameters.
Traceback (most recent call last):
It looks like somehow the force merge API response (and not the tasks list API response) is returned:
$ curl -s -k -u esbench:super-secret-password -XPOST "https://10.132.0.63:9200/_forcemerge?pretty"
{
"_shards" : {
"total" : 3250,
"successful" : 3250,
"failed" : 0
}
}
I'm becoming increasingly more convinced we have a bug in RallyAsyncElasticsearch
. I wrote a small script that uses the vanilla AsyncElasticsearch
client to call the tasks API in a loop:
from elasticsearch import AsyncElasticsearch
import asyncio
import ssl
import certifi
from datetime import datetime, timezone
ssl_context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH, cafile=certifi.where())
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
es = AsyncElasticsearch(
hosts=["https://10.132.0.63:9200"],
ssl_context=ssl_context,
verify_certs=False,
basic_auth=("elastic", "changeme"),
)
async def tasks():
while True:
tasks = await es.tasks.list(params={"actions": "indices:admin/forcemerge"})
print(datetime.now(timezone.utc), tasks)
async def main():
await tasks()
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I run this script concurrently with the Rally reproduction and don't see this behaviour.
Standalone AsyncElasticsearch:
2023-11-21 05:57:47.732064+00:00 {'nodes': {'7RXtokVtQciED8QZpiJ4Vw': {'name': 'elasticsearch-1', 'transport_address': '10.132.0.63:9300', 'host': '10.132.0.63', 'ip': '10.132.0.63:9300', 'roles': ['data', 'data_cold', 'data_content', 'data_frozen', 'data_hot', 'data_warm', 'ingest', 'master', 'ml', 'remote_cluster_client', 'transform'], 'attributes': {'xpack.installed': 'true', 'transform.config_version': '10.0.0'}, 'tasks': {'7RXtokVtQciED8QZpiJ4Vw:10080922': {'node': '7RXtokVtQciED8QZpiJ4Vw', 'id': 10080922, 'type': 'transport', 'action': 'indices:admin/forcemerge', 'start_time_in_millis': 1700546229760, 'running_time_in_nanos': 37971235878, 'cancellable': False, 'headers': {}}}}}}
2023-11-21 05:57:47.733552+00:00 {'nodes': {}}
Rally logs:
2023-11-21 05:57:47,733 -not-actor-/PID:5295 elastic_transport.node DEBUG > GET /_tasks?actions=indices%3Aadmin%2Fforcemerge HTTP/1.1
> Accept: application/vnd.elasticsearch+json; compatible-with=8
> Authorization: Basic <hidden>
> Connection: keep-alive
> User-Agent: elasticsearch-py/8.6.1 (Python/3.11.4; elastic-transport/8.4.1)
> X-Elastic-Client-Meta: es=8.6.1,py=3.11.4,t=8.4.1,ai=3.9.0
< HTTP/1.1 200 OK
< Content-Length: 55
< Content-Type: application/vnd.elasticsearch+json;compatible-with=8
< X-Elastic-Product: Elasticsearch
< {"_shards":{"total":3250,"successful":3250,"failed":0}}
It's definitely looking like a client side issue. I've had mixed luck reproducing the issue with a standalone script using both the vanilla AsyncElasticsearch
client and RallyAsyncElasticsearch
client, but I have a 100% success rate in producing this error when invoking the Rally track.
I enabled TRACE
level HTTP request and response body logging at the server side, as well as setting some client side x-opaque-id
headers to further assist mapping request and responses, and this is what I got.
Setting TRACE
logging:
# network tracing in jvm.options
-Des.insecure_network_trace_enabled=true
curl -k -u elastic:changeme -X PUT "https://10.132.0.63:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient" : {
"logger.org.elasticsearch.http.HttpTracer" : "TRACE",
"logger.org.elasticsearch.http.HttpBodyTracer" : "TRACE"
}
}
'
curl -k u elastic:changeme -XPUT "https://10.132.0.63:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient" : {
"http.tracer.include" :[ "*/_tasks*" , "*forcemerge*" ],
"http.tracer.exclude" : ""
}
}
'
What Elasticsearch reports:
[2023-11-22T00:08:50,429][TRACE][o.e.h.HttpTracer ] [elasticsearch-1] [3559][rally-forcemerge][POST][/logs-kafka.log-default,logs-nginx.access-default,logs-nginx.error-default,logs-k8-application.log-default,logs-mysql.slowlog-default,logs-mysql.error-default,logs-system.auth-default,logs-system.syslog-default,logs-postgresql.log-default,logs-apache.access-default,logs-apache.error-default,logs-redis.log-default,logs-redis.slowlog-default/_forcemerge?max_num_segments=1] received request from [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}]
[2023-11-22T00:08:50,429][TRACE][o.e.h.HttpBodyTracer ] [elasticsearch-1] [3559] request body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/wMAAAAAAAAAAAA=
[2023-11-22T00:09:05,665][TRACE][o.e.h.HttpTracer ] [elasticsearch-1] [3560][rally-tasks][GET][/_tasks?actions=indices:admin/forcemerge] received request from [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}]
[2023-11-22T00:09:05,666][TRACE][o.e.h.HttpBodyTracer ] [elasticsearch-1] [3560] request body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/wMAAAAAAAAAAAA=
[2023-11-22T00:09:15,290][WARN ][o.e.c.c.Coordinator ] [elasticsearch-1] This node is a fully-formed single-node cluster with cluster UUID [HR1fLqrpRBqfsV4ePrU94A], but it is configured as if to discover other nodes and form a multi-node cluster via the [discovery.seed_hosts=[10.132.0.63:9300]] setting. Fully-formed clusters do not attempt to discover other nodes, and nodes with different cluster UUIDs cannot belong to the same cluster. The cluster UUID persists across restarts and can only be changed by deleting the contents of the node's data path(s). Remove the discovery configuration to suppress this message.
[2023-11-22T00:09:17,217][TRACE][o.e.h.HttpBodyTracer ] [elasticsearch-1] [3559] response body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/6tWii/OSCxKKVayqlYqyS9JzFGyMjYyNdBRKi5NTk4tLk4rhYukJWbmpKYoWRnU1gIAV44k8zcAAAA=
[2023-11-22T00:09:17,218][TRACE][o.e.h.HttpTracer ] [elasticsearch-1] [3559][rally-forcemerge][OK][application/vnd.elasticsearch+json;compatible-with=8][55] sent response to [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}] success [true]
[2023-11-22T00:09:17,218][TRACE][o.e.h.HttpBodyTracer ] [elasticsearch-1] [3560] response body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/31RTU/DMAz9LzlvpR9bx3qGA1wQHBACocok3hYtdUri8in+O86+AAloD4nznu3n53dF3mBUzbuaXd2wX1/zpbanJ8eXt709n1w/J4SgQ9UodBDZ6ogQ9GpcqJHiABR7H7gFYwJGqaOKPCuqMsuzumrmVZ4LbeUj/0Tk0fa/k4N3Sc+dMsAgcTpa7Z35uhMj8T5cBP+GtI9W/gA8Q+hSH1piTI+dqMeQLi61wc4zttoN6VVOu625GWnhJfV+pIA52IeBtwYdoEwkLOyyfcIQraftHPJL+ksPep1ZigzOoRGIw4DqQwpDXP/jc1NW8+N5Pd/4LSuRzD8WIiNJ3R1d6r72uGmzW4XgoHkry5KxGmMDprN0JNI1dhiWKBwRKGtj22Frqe2sc1bUFbM8r4tiVuWTUmqHgUj8O9AIyCfWtKxqaT+d1iOlgTQ6Bw9OZCzARZSFIxjxJs1yM77o4XHA8VkyI4grr+NvQj423yfrFW6LhgIAAA==
[2023-11-22T00:09:17,218][TRACE][o.e.h.HttpTracer ] [elasticsearch-1] [3560][rally-tasks][OK][application/vnd.elasticsearch+json;compatible-with=8][null] sent response to [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}] success [true]
If we then decode those responses:
# all responses
cat trace.log | grep 'HttpBodyTracer' | sed -e 's/.*: //' | base64 --decode | gzip --decompress | jq .
{
"_shards": {
"total": 3250,
"successful": 3250,
"failed": 0
}
}
{
"nodes": {
"7RXtokVtQciED8QZpiJ4Vw": {
"name": "elasticsearch-1",
"transport_address": "10.132.0.63:9300",
"host": "10.132.0.63",
"ip": "10.132.0.63:9300",
"roles": [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"attributes": {
"transform.config_version": "10.0.0",
"xpack.installed": "true"
},
"tasks": {
"7RXtokVtQciED8QZpiJ4Vw:2398969": {
"node": "7RXtokVtQciED8QZpiJ4Vw",
"id": 2398969,
"type": "transport",
"action": "indices:admin/forcemerge",
"start_time_in_millis": 1700611730429,
"running_time_in_nanos": 15236989556,
"cancellable": false,
"headers": {
"X-Opaque-Id": "rally-forcemerge"
}
}
}
}
}
}
# request id 3559 which is forcemerge
$ cat trace.log | grep 'HttpBodyTracer' | grep '3559' | sed -e 's/.*: //' | base64 -
-decode | gzip --decompress | jq .
{
"_shards": {
"total": 3250,
"successful": 3250,
"failed": 0
}
}
# request id 3560 which is tasks
$ cat trace.log | grep 'HttpBodyTracer' | grep '3560' | sed -e 's/.*: //' | base64 -
-decode | gzip --decompress | jq .
{
"nodes": {
"7RXtokVtQciED8QZpiJ4Vw": {
"name": "elasticsearch-1",
"transport_address": "10.132.0.63:9300",
"host": "10.132.0.63",
"ip": "10.132.0.63:9300",
"roles": [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"attributes": {
"transform.config_version": "10.0.0",
"xpack.installed": "true"
},
"tasks": {
"7RXtokVtQciED8QZpiJ4Vw:2398969": {
"node": "7RXtokVtQciED8QZpiJ4Vw",
"id": 2398969,
"type": "transport",
"action": "indices:admin/forcemerge",
"start_time_in_millis": 1700611730429,
"running_time_in_nanos": 15236989556,
"cancellable": false,
"headers": {
"X-Opaque-Id": "rally-forcemerge"
}
}
}
}
}
}
What the Python client's elastic transport package logged:
2023-11-22 00:09:17,219 -not-actor-/PID:21969 elastic_transport.node DEBUG > GET /_tasks?actions=indices%3Aadmin%2Fforcemerge HTTP/1.1
> Accept: application/vnd.elasticsearch+json; compatible-with=8
> Authorization: Basic <hidden>
> Connection: keep-alive
> User-Agent: elasticsearch-py/8.6.1 (Python/3.11.4; elastic-transport/8.4.1)
> X-Elastic-Client-Meta: es=8.6.1,py=3.11.4,t=8.4.1,ai=3.9.0
> X-Opaque-Id: rally-tasks
< HTTP/1.1 200 OK
< Content-Length: 55
< Content-Type: application/vnd.elasticsearch+json;compatible-with=8
< X-Opaque-Id: rally-forcemerge
< X-Elastic-Product: Elasticsearch
< {"_shards":{"total":3250,"successful":3250,"failed":0}}
2023-11-22 00:09:17,219 -not-actor-/PID:21969 elastic_transport.transport INFO GET https://10.132.0.63:9200/_tasks?actions=indices%3Aadmin%2Fforcemerge [status:200 duration:11.555s]
Talk about convenient timing! I've confirmed we're hitting this bug https://github.com/aio-libs/aiohttp/issues/7864.
aiohttp
is a transitive dependency of esrally
and 3.9.0
was released on the 18th Nov, which is around the first time we noticed this bug caught by CI on the 19th.
I think I'll raise a PR to pin us to 3.8.6
, which was the last release before 3.9.0
.
No longer needed, https://github.com/elastic/rally/pull/1806 is the fix.
Apparently it's possible that we just swallow the exception here. In 100s of runs we haven't seen this before, but I'd like to properly capture the error if it happens again, then we can decide how (or if) we add any resiliency.