elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
889 stars 24.81k forks source link

Collect and display execution metadata for ES|QL cross cluster searches #112402

Closed quux00 closed 1 month ago

quux00 commented 2 months ago

Description

Enhance query responses to include information about the shards and clusters against which the query was executed.

The goal of this issue is to provide parity between the metadata displayed for cross-cluster searches in _search and ES|QL.

In particular, for a cross-cluster query:

  1. record and display overall took time (in millis)
  2. display total number of clusters involved in the search, how many are still running the search or were successful, skipped, or failed in the search
  3. display metadata details of the search for each cluster, including status, indices searched, took time (per cluster), and shard accounting (total, successful, skipped, failed)

Here is an example of the above metadata in a SearchResponse (using _search or _async_search). Some of the fields or statuses in _search are not (yet?) supported in ES|QL so will be not used in ES|QL for the time being.

  "took": 6107,    // overall search took time (millis)
  "_clusters": {
    "total": 3,
    "successful": 2,
    "skipped": 0,
    "running": 0,
    "partial": 1,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 57,           // took time for search on each cluster
        "timed_out": false,   // not relevant to ESQL
        "_shards": {
          "total": 15,
          "successful": 15,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "successful",
        "indices": "*",
        "took": 53,
        "timed_out": false,
        "_shards": {
          "total": 6,
          "successful": 6,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "partial",   // status currently not supported by ESQL
        "indices": "logs*",
        "took": 6091,
        "timed_out": false,
        "_shards": {
          "total": 15,
          "successful": 14,
          "skipped": 0,
          "failed": 1
        },
        "failures": [     // partial success/failures currently not supported by ESQL
          {
            "shard": 0,
            "index": "remote1:blogs",
            "node": "y4KZhJlFR16Mh07G-gd-dA",
            "reason": {
              "type": "query_shard_exception",
              "reason": "failed to create query: [blogs][0] remote1 exception",
              "index_uuid": "Bbf_tTEJSi2t_b06PaQOBw",
              "index": "remote1:blogs",
              "caused_by": {
                "type": "runtime_exception",
                "reason": "runtime_exception: [blogs][0] remote1 exception"
              }
            }
          }
        ]
      }
    }
  }

The ES|QL response will be modified to look like:

{
  "is_running": false,
  "took": 381,          // new field (overall query took time)
  "all_columns": [
    {
      "name": "authors.first_name",
      "type": "text"
    },
    {
      "name": "publish_date",
      "type": "date"
    }
  ],
  "columns": [   // contents elided 
  ],
  "values": [  // contents elided
  ],
  "_clusters": {  // new cluster details section
    "total": 3,
    "successful": 3,
    "running": 0,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 373,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "successful",
        "indices": "remote2:web_traffic,remote2:blogs",
        "took": 371,
        "_shards": {
          "total": 6,
          "successful": 6,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "remote1:blogs,remote1:web_traffic",
        "took": 377,
        "_shards": {
          "total": 15,
          "successful": 15,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }
}

Additional changes

End user docs around cross-cluster search will need to be updated.

Stakeholders

The Kibana team has asked for the cluster metadata in ES|QL to be as similar as possible to that in SearchResponses. They will need to review the proposal here.



Implementation / Detailed Design Proposal

We introduce an EsqlExecutionInfo object that is patterned (largely copied) from the SearchResponse.Clusters class in the _search codebase.

It will track all of the information described in the previous section.

TransportEsqlQueryAction / field-caps / IndexResolver

The cluster map in EsqlExecutionInfo is first populated at this stage, since this is the first point in the ES|QL processing chain that the clusters involved in the query are resolved. For each cluster involved in the search (including the local, querying cluster), a Cluster object will be created holding the cluster alias, a status of RUNNING and skip_unavailable setting and put into the EsqlExecutionInfo that is created for this particular ES|QL search.

If an async ES|QL query is made and the async endpoint is polled while the search is still running on a given, the async search response will include the clusters with running status, like so:

  "_clusters": {
    "total": 2,
    "successful": 0,
    "skipped": 0,
    "running": 2,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "running",
        "indices": "blogs",
      },
      "remote1": {
        "status": "running",
        "indices": "*",
      }
    }
  }

Modifiications to ComputeListener

The ComputeListener and ComputeResponse is used by both local and remote clusters and for both the cross-cluster (minimize_round_trips=true) requests as well as within-cluster DataNodeRequests. In order to properly account for metadata coming back, it was necessary to provide separate methods within ComputeListener for these two pathways so that is clear what metadata to expect in the ComputeResponse.

Shard Accounting

Shard accounting is done after the Search Shards API is called. Here we get a count of the total number of shards in scope for the search and how many, if any, will be skipped (due to can_match). That information is then recorded for each cluster in the EsqlExecutionInfo.Cluster object.

This will provide accurate counts of the total number of shards and the number of skipped shards. We also fill in failed_shards=0 and successful_shards=total_shards since ES|QL does not currently return partial results when searches on some shards fail. If ES|QL does change later to support partial results in that way, then additional shard accounting will be added during/after the DataNodeRequest phase of ES|QL processing.

quux00 commented 2 months ago

What information do we want in the cluster/details when a user specifies a remote cluster where there is no matching index?

Example: query is:

FROM logs-*,cluster-a:no_such_index | STATS sum (v)

Do we want only the local cluster present in the response or do we want the remote "cluster-a" also listed, like so:

  "_clusters": {
    "total": 2,
    "successful": 2,
    "running": 0,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "logs-1",
        "took": 373,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "cluster-a": {
        "status": "successful",
        "indices": "cluster-a:no_such_index",
        "took": 13,
        "_shards": {
          "total": 0,   // while it is considered a "successful search" (no failures), the shard count of 0 indicates that no shards were searched
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }

The above model is how it is handled in _search, so I plan to follow the same model for ES|QL unless Product or the ESQL team want to have this intentionally behave differently in ESQL.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)