elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
869 stars 24.81k forks source link

ES|QL: Support aggregation commands on histogram fields #103060

Open dgieselaar opened 11 months ago

dgieselaar commented 11 months ago

Description

Currently, running an ES|QL aggregation command on a histogram field results in an error. Aggregations should be supported on histogram fields, similar to how _search supports aggregations on histogram fields.

Use cases

In APM we use histograms to store latency distribution data (in transaction.duration.histogram). The aggregations we currently run on this field are: avg, pxx, sum, value_count.

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-ql (Team:QL)

elasticsearchmachine commented 11 months ago

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

nik9000 commented 11 months ago

Do you have any examples of things? I can guess, but it'd be nice to have an example of the kinds of STATS you expect.

One problem with the STATS here is that ESQL allows a lot more slicing that _search does so it'd be easier to put the query into a state where it wouldn't have the data. I'm kind of imagining something like FROM foo | WHERE hostname = 'blah' | STATS PERCENTILES(bytes_out) where hostname is a field that got removed in a downsampling operation. I suppose that thing's just not supported. I guess we'd get it for free by the field just not being there. Though maybe the error message should be different? I dunno.

dgieselaar commented 11 months ago

@nik9000 AVG, SUM, MIN, MAX, Pxx. I'm not sure if I follow your example?

elasticsearchmachine commented 10 months ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

luigidellaquila commented 10 months ago

I think first of all we'll have to support histogram field type (read and output at least). Since a histogram field is practically an object containing two arrays, I can imagine it returned as a JSON. Supporting new field types has some cost by itself and is not trivial.

After that, we can start defining the behavior for the single agg functions, starting from min, max, count and avg. I guess it won't be much different from how _search implements them, eg. for the sample data reported here

PUT my-index-000001
{
  "mappings" : {
    "properties" : {
      "my_histogram" : {
        "type" : "histogram"
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "my_text" : "histogram_1",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}

PUT my-index-000001/_doc/2
{
  "my_text" : "histogram_2",
  "my_histogram" : {
      "values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], 
      "counts" : [8, 17, 8, 7, 6, 2] 
   }
}

I guess the ESQL usage will be something like:

from my-index* | stats max = max(my_histogram), count = count(my_histogram) by my_text;

my_text     | max    | count
histogram_1 | 0.5    | 51
histogram_2 | 0.5    | 48

where max(my_histogram) is calculated on the "values", while count(my_histogram) is the sum of the "counts". We will have to define the behavior of each single aggregation function, but at a first look it seems pretty natural at least for the basic aggs, and we can start from this as a guideline.

Wondering if it makes sense to allow histogram fields in other commands apart from STATS. Maybe they can be used in EVAL for simple assignment (no manipulation, at least in a first phase) and KEEP/DROP, but it's hard for me to imagine how to use them in commands like SORT, ENRICH and so on.

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

not-napoleon commented 6 months ago

I suggest we wait to implement histogram support until we encode the algorithm in the field (see https://github.com/elastic/elasticsearch/issues/108208). This will let us choose the appropriate sketch for percentiles against the histogram, at a minimum, and may influence the implementation of other aggregations.

jindrichpilar-kosik commented 2 months ago

Hi,

I would like to suggest adding histogram aggregation on histogram field. There is already an issue for visualization of this, but the implementators decided to wait for ES|QL support. https://github.com/elastic/kibana/issues/112390#issuecomment-2009822801

As a customer we see cases for it in Observability area as we have histograms via OpenTelemetry.