[Dataset quality] Optimize `degraded_docs` query

awahab07 commented 4 months ago

Dataset Quality, while fetching degraded docs percentage for data streams uses composite aggregation to produce the following information:

Problem

The query to fetch total documents per data stream per space (see below) is significantly slower on large clusters. It is more than 10 times slower than the ignored documents query (when _ignored filter is present in query, see). This is particularly true on clusters which are busy ingesting live logs.

Also, the endpoint issues an extra call to ES to fetch last empty page of buckets which can be prevented.

Endpoint

Endpoint: /internal/dataset_quality/data_streams/degraded_docs Result:

[
  {
    "dataset": "logs-apache.access-default",
    "count": 490,
    "docsCount": 9692,
    "percentage": 5.055716054477919
  },
  {
    "dataset": "logs-apache.error-default",
    "count": 33,
    "docsCount": 23900,
    "percentage": 0.13807531380753138
  }
]

Queries used:

To get total docs per data stream per space : POST /logs-*/_search

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-3d/d",
              "lte": "now/d"
            }
          }
        },
        {
          "term": {
            "data_stream.type": "logs"
          }
        }
      ]
    }
  },
  "aggs": {
    "datasets": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "dataset": {
              "terms": {
                "field": "data_stream.dataset"
              }
            }
          },
          {
            "namespace": {
              "terms": {
                "field": "data_stream.namespace"
              }
            }
          }
        ]
      }
    }
  }
}

To get _ignored docs per data stream per space : POST /logs-*/_search

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-3d/d",
              "lte": "now/d"
            }
          }
        },
        {
          "term": {
            "data_stream.type": "logs"
          }
        }
      ],
      "must": {
        "exists": {
          "field": "_ignored"
        }
      }
    }
  },
  "aggs": {
    "datasets": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "dataset": {
              "terms": {
                "field": "data_stream.dataset"
              }
            }
          },
          {
            "namespace": {
              "terms": {
                "field": "data_stream.namespace"
              }
            }
          }
        ]
      }
    }
  }
}

Preview:

https://github.com/elastic/kibana/assets/2748376/e645ce7a-f3f2-4b65-8332-15706f09a408

### Tasks
- [ ] Optimize the total documents per data stream per space query.
- [x] Prevent an extra call to ES if there are no more records to fetch. https://github.com/elastic/kibana/pull/185975

elasticmachine commented 4 months ago

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

yngrdyn commented 3 months ago

This was investigated by @flash1293.

Problem Elasticsearch can optimize a single terms agg, but can’t for nested terms or composite aggs, so it needs to check every doc which is expensive.

Ideas to explore

Do a big filters agg for all dataset/namespace pairs - this should be optimized heavily and return super quick. Do the composite agg in kibana. concerns:
- Large responses will consume resources in Kibana to process the information.
Change the UI: Only fetch per dataset, not per namespace. Then, after the user selected the dataset, fetch the namespaces concerns:
- Current UI might not be the ideal UI for users, let's wait for some telemetry to arrive and base decisions on data.

elastic / kibana