Capture 'ForceMerge' task list exceptions

b-deam commented 1 year ago

Apparently it's possible that we just swallow the exception here. In 100s of runs we haven't seen this before, but I'd like to properly capture the error if it happens again, then we can decide how (or if) we add any resiliency.

b-deam commented 1 year ago

Hmm, actually turns out there's some scenario where we get a successful request, but the response body doesn't contain an empty nodes key. I can't repro this locally:

$ curl -k -u elastic:changeme -X GET "https://localhost:9200/_tasks?actions=indices:admin/forcemerge&pretty"
{
  "nodes" : { }
}

So I've got a reproduction env setup and re-running with the elastic_transport logger set to DEBUG.

gbanasiak commented 1 year ago

Stack trace that triggered this PR:

2023-11-19 20:04:38,50 ActorAddr-(T|:39957)/PID:9595 esrally.driver.driver ERROR Cannot execute runner [user-defined context-manager enabled runner for [force-merge]]; most likely due to missing parameters.
Traceback (most recent call last):

  File "/home/esbench/rally/esrally/driver/driver.py", line 2014, in execute_single
    return_value = await runner(es, params)
                   ^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 297, in __call__
    return await self.delegate(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 412, in __call__
    return_value = await self.delegate(*args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 345, in __call__
    return await self.delegate(self.client_extractor(args[0]), *args[1:])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 697, in __call__
    if len(tasks["nodes"]) == 0:
           ~~~~~^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_response.py", line 188, in __getitem__
    return self.body[item]  # type: ignore[index]
           ~~~~~~~~~^^^^^^

KeyError: 'nodes'

2023-11-19 20:04:38,52 ActorAddr-(T|:39957)/PID:9595 esrally.driver.driver ERROR Could not execute schedule
Traceback (most recent call last):

  File "/home/esbench/rally/esrally/driver/driver.py", line 2014, in execute_single
    return_value = await runner(es, params)
                   ^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 297, in __call__
    return await self.delegate(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 412, in __call__
    return_value = await self.delegate(*args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 345, in __call__
    return await self.delegate(self.client_extractor(args[0]), *args[1:])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 697, in __call__
    if len(tasks["nodes"]) == 0:
           ~~~~~^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_response.py", line 188, in __getitem__
    return self.body[item]  # type: ignore[index]
           ~~~~~~~~~^^^^^^

KeyError: 'nodes'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/esbench/rally/esrally/driver/driver.py", line 1920, in __call__
    total_ops, total_ops_unit, request_meta_data = await execute_single(runner, self.es, params, self.on_error)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 2081, in execute_single
    raise exceptions.SystemSetupError(msg)

esrally.exceptions.SystemSetupError: Cannot execute [user-defined context-manager enabled runner for [force-merge]]. Provided parameters are: ['index', 'max-num-segments', 'mode', 'poll-period', 'request-timeout', 'headers', 'opaque-id', 'operation-type']. Error: ['nodes'].

2023-11-19 20:04:38,53 ActorAddr-(T|:39957)/PID:9595 esrally.driver.driver INFO Worker[0] finished executing tasks ['force-merge-data-streams'] in 1204.683508 seconds
2023-11-19 20:04:39,686 ActorAddr-(T|:39957)/PID:9595 esrally.actor ERROR Worker[0] has detected a benchmark failure. Notifying master...
Traceback (most recent call last):

  File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1765, in __call__
    loop.run_until_complete(self.run())

  File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1817, in run
    _ = await asyncio.gather(*awaitables)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1982, in __call__
    raise exceptions.RallyError(f"Cannot run task [{self.task}]: {e}") from None

esrally.exceptions.RallyError: Cannot run task [force-merge-data-streams]: Cannot execute [user-defined context-manager enabled runner for [force-merge]]. Provided parameters are: ['index', 'max-num-segments', 'mode', 'poll-period', 'request-timeout', 'headers', 'opaque-id', 'operation-type']. Error: ['nodes'].

Exception is raised in here and caught here. Based on this we know the exception is KeyError so my interpretation matches description from https://github.com/elastic/rally/pull/1804#issuecomment-1818341840. If that's correct the added exception handlers will not catch anything. Do you agree @b-deam ?

b-deam commented 12 months ago

Exception is raised in here and caught here. Based on this we know the exception is KeyError so my interpretation matches description from #1804 (comment). If that's correct the added exception handlers will not catch anything. Do you agree @b-deam ?

I should have left this in draft because I've been trying to reproduce the issue, but I agree, and I came to the same conclusion as you did - I just couldn't see how a successful response would be missing nodes property, and I saw this error in the logs, so I threw out a hail mary to see if we were somehow swallowing a client exception:

2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://10.132.0.63:9200)> has been marked alive after a successful request
2023-11-21 03:16:14,821 ActorAddr-(T|:38803)/PID:8358 esrally.driver.driver ERROR Cannot execute runner [user-defined context-manager enabled runner for [force-merge]]; most likely due to missing parameters.
Traceback (most recent call last):

But, I've now managed to reproduce the issue and I'm squarely back in the ❓ camp.. as it looks like it's possible the API response has changed?

2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.node DEBUG > GET /_tasks?actions=indices%3Aadmin%2Fforcemerge HTTP/1.1
> Accept: application/vnd.elasticsearch+json; compatible-with=8
> Authorization: Basic <hidden>
> Connection: keep-alive
> User-Agent: elasticsearch-py/8.6.1 (Python/3.11.4; elastic-transport/8.4.1)
> X-Elastic-Client-Meta: es=8.6.1,py=3.11.4,t=8.4.1,ai=3.9.0
< HTTP/1.1 200 OK
< Content-Length: 55
< Content-Type: application/vnd.elasticsearch+json;compatible-with=8
< X-Elastic-Product: Elasticsearch
< {"_shards":{"total":3250,"successful":3250,"failed":0}}
2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.transport INFO GET https://10.132.0.63:9200/_tasks?actions=indices%3Aadmin%2Fforcemerge [status:200 duration:12.160s]
2023-11-21 03:16:14,821 -not-actor-/PID:8358 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://10.132.0.63:9200)> has been marked alive after a successful request
2023-11-21 03:16:14,821 ActorAddr-(T|:38803)/PID:8358 esrally.driver.driver ERROR Cannot execute runner [user-defined context-manager enabled runner for [force-merge]]; most likely due to missing parameters.
Traceback (most recent call last):

It looks like somehow the force merge API response (and not the tasks list API response) is returned:

$ curl -s -k -u esbench:super-secret-password -XPOST "https://10.132.0.63:9200/_forcemerge?pretty"
{
  "_shards" : {
    "total" : 3250,
    "successful" : 3250,
    "failed" : 0
  }
}

b-deam commented 12 months ago

I'm becoming increasingly more convinced we have a bug in RallyAsyncElasticsearch. I wrote a small script that uses the vanilla AsyncElasticsearch client to call the tasks API in a loop:

from elasticsearch import AsyncElasticsearch
import asyncio
import ssl
import certifi
from datetime import datetime, timezone

ssl_context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH, cafile=certifi.where())
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE

es = AsyncElasticsearch(
  hosts=["https://10.132.0.63:9200"],
  ssl_context=ssl_context,
  verify_certs=False,
  basic_auth=("elastic", "changeme"),
)

async def tasks():
  while True:
    tasks = await es.tasks.list(params={"actions": "indices:admin/forcemerge"})
    print(datetime.now(timezone.utc), tasks)

async def main():
  await tasks()

if __name__ == "__main__":
  loop = asyncio.get_event_loop()
  loop.run_until_complete(main())

I run this script concurrently with the Rally reproduction and don't see this behaviour.

Standalone AsyncElasticsearch:

2023-11-21 05:57:47.732064+00:00 {'nodes': {'7RXtokVtQciED8QZpiJ4Vw': {'name': 'elasticsearch-1', 'transport_address': '10.132.0.63:9300', 'host': '10.132.0.63', 'ip': '10.132.0.63:9300', 'roles': ['data', 'data_cold', 'data_content', 'data_frozen', 'data_hot', 'data_warm', 'ingest', 'master', 'ml', 'remote_cluster_client', 'transform'], 'attributes': {'xpack.installed': 'true', 'transform.config_version': '10.0.0'}, 'tasks': {'7RXtokVtQciED8QZpiJ4Vw:10080922': {'node': '7RXtokVtQciED8QZpiJ4Vw', 'id': 10080922, 'type': 'transport', 'action': 'indices:admin/forcemerge', 'start_time_in_millis': 1700546229760, 'running_time_in_nanos': 37971235878, 'cancellable': False, 'headers': {}}}}}}
2023-11-21 05:57:47.733552+00:00 {'nodes': {}}

Rally logs:

2023-11-21 05:57:47,733 -not-actor-/PID:5295 elastic_transport.node DEBUG > GET /_tasks?actions=indices%3Aadmin%2Fforcemerge HTTP/1.1
> Accept: application/vnd.elasticsearch+json; compatible-with=8
> Authorization: Basic <hidden>
> Connection: keep-alive
> User-Agent: elasticsearch-py/8.6.1 (Python/3.11.4; elastic-transport/8.4.1)
> X-Elastic-Client-Meta: es=8.6.1,py=3.11.4,t=8.4.1,ai=3.9.0
< HTTP/1.1 200 OK
< Content-Length: 55
< Content-Type: application/vnd.elasticsearch+json;compatible-with=8
< X-Elastic-Product: Elasticsearch
< {"_shards":{"total":3250,"successful":3250,"failed":0}}

b-deam commented 12 months ago

It's definitely looking like a client side issue. I've had mixed luck reproducing the issue with a standalone script using both the vanilla AsyncElasticsearch client and RallyAsyncElasticsearch client, but I have a 100% success rate in producing this error when invoking the Rally track.

I enabled TRACE level HTTP request and response body logging at the server side, as well as setting some client side x-opaque-id headers to further assist mapping request and responses, and this is what I got.

Setting TRACE logging:

# network tracing in jvm.options
-Des.insecure_network_trace_enabled=true

curl -k -u elastic:changeme -X PUT "https://10.132.0.63:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
   "transient" : {
      "logger.org.elasticsearch.http.HttpTracer" : "TRACE",
      "logger.org.elasticsearch.http.HttpBodyTracer" : "TRACE"
   }
}
'

curl -k u elastic:changeme -XPUT "https://10.132.0.63:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
   "transient" : {
      "http.tracer.include" :[ "*/_tasks*" , "*forcemerge*" ],
      "http.tracer.exclude" : ""

   }
}
'

What Elasticsearch reports:

[2023-11-22T00:08:50,429][TRACE][o.e.h.HttpTracer         ] [elasticsearch-1] [3559][rally-forcemerge][POST][/logs-kafka.log-default,logs-nginx.access-default,logs-nginx.error-default,logs-k8-application.log-default,logs-mysql.slowlog-default,logs-mysql.error-default,logs-system.auth-default,logs-system.syslog-default,logs-postgresql.log-default,logs-apache.access-default,logs-apache.error-default,logs-redis.log-default,logs-redis.slowlog-default/_forcemerge?max_num_segments=1] received request from [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}]
[2023-11-22T00:08:50,429][TRACE][o.e.h.HttpBodyTracer     ] [elasticsearch-1] [3559] request body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/wMAAAAAAAAAAAA=
[2023-11-22T00:09:05,665][TRACE][o.e.h.HttpTracer         ] [elasticsearch-1] [3560][rally-tasks][GET][/_tasks?actions=indices:admin/forcemerge] received request from [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}]
[2023-11-22T00:09:05,666][TRACE][o.e.h.HttpBodyTracer     ] [elasticsearch-1] [3560] request body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/wMAAAAAAAAAAAA=
[2023-11-22T00:09:15,290][WARN ][o.e.c.c.Coordinator      ] [elasticsearch-1] This node is a fully-formed single-node cluster with cluster UUID [HR1fLqrpRBqfsV4ePrU94A], but it is configured as if to discover other nodes and form a multi-node cluster via the [discovery.seed_hosts=[10.132.0.63:9300]] setting. Fully-formed clusters do not attempt to discover other nodes, and nodes with different cluster UUIDs cannot belong to the same cluster. The cluster UUID persists across restarts and can only be changed by deleting the contents of the node's data path(s). Remove the discovery configuration to suppress this message.
[2023-11-22T00:09:17,217][TRACE][o.e.h.HttpBodyTracer     ] [elasticsearch-1] [3559] response body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/6tWii/OSCxKKVayqlYqyS9JzFGyMjYyNdBRKi5NTk4tLk4rhYukJWbmpKYoWRnU1gIAV44k8zcAAAA=
[2023-11-22T00:09:17,218][TRACE][o.e.h.HttpTracer         ] [elasticsearch-1] [3559][rally-forcemerge][OK][application/vnd.elasticsearch+json;compatible-with=8][55] sent response to [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}] success [true]
[2023-11-22T00:09:17,218][TRACE][o.e.h.HttpBodyTracer     ] [elasticsearch-1] [3560] response body (gzip compressed and base64-encoded; for details see https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-network.html#http-rest-request-tracer): H4sIAAAAAAAA/31RTU/DMAz9LzlvpR9bx3qGA1wQHBACocok3hYtdUri8in+O86+AAloD4nznu3n53dF3mBUzbuaXd2wX1/zpbanJ8eXt709n1w/J4SgQ9UodBDZ6ogQ9GpcqJHiABR7H7gFYwJGqaOKPCuqMsuzumrmVZ4LbeUj/0Tk0fa/k4N3Sc+dMsAgcTpa7Z35uhMj8T5cBP+GtI9W/gA8Q+hSH1piTI+dqMeQLi61wc4zttoN6VVOu625GWnhJfV+pIA52IeBtwYdoEwkLOyyfcIQraftHPJL+ksPep1ZigzOoRGIw4DqQwpDXP/jc1NW8+N5Pd/4LSuRzD8WIiNJ3R1d6r72uGmzW4XgoHkry5KxGmMDprN0JNI1dhiWKBwRKGtj22Frqe2sc1bUFbM8r4tiVuWTUmqHgUj8O9AIyCfWtKxqaT+d1iOlgTQ6Bw9OZCzARZSFIxjxJs1yM77o4XHA8VkyI4grr+NvQj423yfrFW6LhgIAAA==
[2023-11-22T00:09:17,218][TRACE][o.e.h.HttpTracer         ] [elasticsearch-1] [3560][rally-tasks][OK][application/vnd.elasticsearch+json;compatible-with=8][null] sent response to [Netty4HttpChannel{localAddress=/10.132.0.63:9200, remoteAddress=/10.132.0.67:57740}] success [true]

If we then decode those responses:

# all responses
cat trace.log | grep 'HttpBodyTracer' | sed -e 's/.*: //' | base64 --decode | gzip --decompress | jq .
{
  "_shards": {
    "total": 3250,
    "successful": 3250,
    "failed": 0
  }
}
{
  "nodes": {
    "7RXtokVtQciED8QZpiJ4Vw": {
      "name": "elasticsearch-1",
      "transport_address": "10.132.0.63:9300",
      "host": "10.132.0.63",
      "ip": "10.132.0.63:9300",
      "roles": [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "transform.config_version": "10.0.0",
        "xpack.installed": "true"
      },
      "tasks": {
        "7RXtokVtQciED8QZpiJ4Vw:2398969": {
          "node": "7RXtokVtQciED8QZpiJ4Vw",
          "id": 2398969,
          "type": "transport",
          "action": "indices:admin/forcemerge",
          "start_time_in_millis": 1700611730429,
          "running_time_in_nanos": 15236989556,
          "cancellable": false,
          "headers": {
            "X-Opaque-Id": "rally-forcemerge"
          }
        }
      }
    }
  }
}

# request id 3559 which is forcemerge
$ cat trace.log | grep 'HttpBodyTracer' | grep '3559' | sed -e 's/.*: //' | base64 -
-decode | gzip --decompress | jq .
{
  "_shards": {
    "total": 3250,
    "successful": 3250,
    "failed": 0
  }
}

# request id 3560 which is tasks
$ cat trace.log | grep 'HttpBodyTracer' | grep '3560' | sed -e 's/.*: //' | base64 -
-decode | gzip --decompress | jq .
{
  "nodes": {
    "7RXtokVtQciED8QZpiJ4Vw": {
      "name": "elasticsearch-1",
      "transport_address": "10.132.0.63:9300",
      "host": "10.132.0.63",
      "ip": "10.132.0.63:9300",
      "roles": [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "transform.config_version": "10.0.0",
        "xpack.installed": "true"
      },
      "tasks": {
        "7RXtokVtQciED8QZpiJ4Vw:2398969": {
          "node": "7RXtokVtQciED8QZpiJ4Vw",
          "id": 2398969,
          "type": "transport",
          "action": "indices:admin/forcemerge",
          "start_time_in_millis": 1700611730429,
          "running_time_in_nanos": 15236989556,
          "cancellable": false,
          "headers": {
            "X-Opaque-Id": "rally-forcemerge"
          }
        }
      }
    }
  }
}

What the Python client's elastic transport package logged:

2023-11-22 00:09:17,219 -not-actor-/PID:21969 elastic_transport.node DEBUG > GET /_tasks?actions=indices%3Aadmin%2Fforcemerge HTTP/1.1
> Accept: application/vnd.elasticsearch+json; compatible-with=8
> Authorization: Basic <hidden>
> Connection: keep-alive
> User-Agent: elasticsearch-py/8.6.1 (Python/3.11.4; elastic-transport/8.4.1)
> X-Elastic-Client-Meta: es=8.6.1,py=3.11.4,t=8.4.1,ai=3.9.0
> X-Opaque-Id: rally-tasks
< HTTP/1.1 200 OK
< Content-Length: 55
< Content-Type: application/vnd.elasticsearch+json;compatible-with=8
< X-Opaque-Id: rally-forcemerge
< X-Elastic-Product: Elasticsearch
< {"_shards":{"total":3250,"successful":3250,"failed":0}}
2023-11-22 00:09:17,219 -not-actor-/PID:21969 elastic_transport.transport INFO GET https://10.132.0.63:9200/_tasks?actions=indices%3Aadmin%2Fforcemerge [status:200 duration:11.555s]

b-deam commented 12 months ago

Talk about convenient timing! I've confirmed we're hitting this bug https://github.com/aio-libs/aiohttp/issues/7864.

aiohttp is a transitive dependency of esrally and 3.9.0 was released on the 18th Nov, which is around the first time we noticed this bug caught by CI on the 19th.

I think I'll raise a PR to pin us to 3.8.6, which was the last release before 3.9.0.

b-deam commented 12 months ago

No longer needed, https://github.com/elastic/rally/pull/1806 is the fix.

elastic / rally

Capture 'ForceMerge' task list exceptions #1804