elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.49k stars 24.88k forks source link

Configurable shard_size default for term aggregations #84744

Open jade-lucas opened 2 years ago

jade-lucas commented 2 years ago

Description

Request: The ability to set the default shard_size for the terms aggregation in index settings and/or in advanced kibana settings.

Problem Statement: In our environment, we have user groups that prefer to use lens to "slice and dice" their data. One common theme that we are starting to see is that when these users use the term aggregation, they will often point out data discrepancies with averages, median, and similar metrics. When these data discrepancies are brought to our engineers, we layout all the reasons why as described in the below link. Often we direct the end user to use an aggregation based visualization in Kibana and provider a recommendation of the shard_size to be used in the input json section. This resolves the data discrepancy almost all of the time. However we commonly receive suggestions by our user groups that they don't want to set the shard_size every time they create a visualization. Reasons are, they often forget to specify it, they really don't know what it does and miss use it, and some of the user groups prefer to use lens(no shard_size support).

Our developers are responsible for defining index/component templates. It would be ideal if our developers could define default shard_size in an index\component template as an index setting. If not, perhaps the advanced section in Kibana would suffice? I think that allowing a way for advanced users (developers\engineers\admins) to optionally configure the default shard_size would result in fewer reported data discrepancies, reduced time triaging from the technical teams, and better experience for all.

Purposed template settings the default shard_size for term aggregations.

{
    "template": {
      "settings": {
        "index": {
          "lifecycle": {
            "name": "my_ilm_policy"
          },
          "refresh_interval": "15s",
          "number_of_shards": "3",
          "number_of_replicas": "1",
          **"term_default_shard_size": "$size * 2.5 + 15"**
        }
      }
    },
    "_meta": {
      "description": "My Description"
    }
  }

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-shard-size

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

wchaparro commented 2 years ago

Hey there @jade-lucas,

Thanks for your request and detailed description of your use case. Currently, there is no straightforward mechanism for calculating shard size. It's more complicated than performing a simple calculation and we’d like to ensure that we do this right and in a general manner. The general solution is to be able to calculate the aggregation more accurately, and specifically for things like terms aggregation on rare terms, even if it means we need to take more time to do so. This also means we will determine the right shard_size for you to increase accuracy. We are considering this for our longer term roadmap.

Doing the simple calc for the number of shards is something that can be done now. We are keeping this issue open and linking to the related meta issue.