elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.8k stars 24.7k forks source link

Bucket Histogram should return the center of bin and not the start of bin #46074

Closed gk-patel closed 4 years ago

gk-patel commented 5 years ago

The "key" returned by the bucket histogram aggregations always represents the "start" of the bucket. IMHO, the key should return the center of the bucket (i.e. the mean of start and end). Because, this is also good for visualization.

For example, for the following Request,

"aggs": {
    "2": {
      "histogram": {
        "field": "some_value",
        "interval": 1,
        "min_doc_count": 1
      }
    }
  }

the following response was generated,

"aggregations": {
    "2": {
      "buckets": [
        {
          "key": 0,  "doc_count": 31
        },
        {
          "key": 6,  "doc_count": 1
        },
        {
          "key": 7,  "doc_count": 1
        }

And in Kibana, it is show as follows,

image

But this is technically wrong, because the values 6 and 7 should be shown at the start of the bucket.

I thinks its a good idea to show the "key" at the center of the bar, but it should be set to a correct value. In my example, that would be 6.5 and 7.5.

Thank you all for your time and also for considering my request.

elasticmachine commented 5 years ago

Pinging @elastic/es-search

elasticmachine commented 5 years ago

Pinging @elastic/es-analytics-geo

polyfractal commented 5 years ago

This is one of those changes that makes one person happy but another person sad :)

Buckets are keyed by their inclusive start value, and the width of the bucket is the specified interval. How this data is charted is up to the client, and it would be fairly simple for the client to re-center labels based on the key + interval. FWIW, the current labeling scheme used by Kibana is how most other applications generate their histogram labels (Excel, etc).

It might also be confusing if you are working with purely discrete numbers (long, integer, etc) and then start to see buckets with fractional keys.

But this is technically wrong, because the values 6 and 7 should be shown at the start of the bucket.

I'm not sure I'm following this part. What's incorrect about the response/chart?

gk-patel commented 5 years ago

@polyfractal Thank you very much for your answer, but I would like to politely disagree with you.

Buckets are keyed by their inclusive start value, and the width of the bucket is the specified interval.

You are right. In most cases the client has the interval, but there are also cases where the interval is decided by elasticsearch (eg. auto_data_histogram, and I also know there is a feature-request to have something like auto_interval_histograms). In such cases, ES gives back the evaluated interval, but it is not easy to perform simple key+interval calculations, because the returned interval is non-numeric, e.g. interval:'1d' being returned by auto_date_histogram.

FWIW, the current labeling scheme used by Kibana is how most other applications generate their histogram labels (Excel, etc).

You are partially right. In most charting applications/libraries, the key is shown at the center of a bar chart and not the histogram chart. There is a slight difference between the two, but from Kibana perspective they both are the same (because, elasticsearch does all the calculations). Whereas, for something like plotly.js these are two different things. https://plot.ly/javascript/histograms/ https://plot.ly/javascript/bar-charts/ And if you observe the x-axis in both cases, you will see that plotly.js labes them differently.

It might also be confusing if you are working with purely discrete numbers (long, integer, etc) and then start to see buckets with fractional keys.

simply, because it is confusing (or looks unusual at the first glance), does not mean that, it should be represented incorrectly. Even in my example, I am working with discrete values, and it took me sometime to realize that something was going wrong (especially, when applying filters).

I'm not sure I'm following this part. What's incorrect about the response/chart?

I did clarify that previously, but I can elaborate again. The value "6" is the start of the bucket, but it is show at the center of the bucket, where it should say "6.5" (or if you want to adpot from plotlyjs histogram, it should say 6-7)

I see only two options, 1) shifting "6" to the start of the bin: This is a less preferred option, as this would only solve the problem for Kibana, but if someone is using someother charting library, these libraries always put the value at the center of bin (and they dont have a flag to shift the value to the start). 2) return the average/center 6.5 of the bin from elasticsearch: This is preferrable, as this will work out of the box with Kibana and also with other client side frameworks. Also, this will be visually more appealing (except for the people working with discrete values, I guess, haha), as it will be at the center of the bar, which is how most people are used to seeing both bar and histogram charts.

Thank you very much for your consideration and pacience in reading my long explaination. Looking forward to your comments.

polyfractal commented 5 years ago

Sorry for the delay in responding. I think there are two separate concerns here:

  1. The response that Elasticsearch returns
  2. How external clients (Kibana, etc) plot the data

Item 2 is not under control of Elasticsearch, so the goal of ES is to provide enough data for an external client to plot it however they desire. For "fixed" histograms (histogram, date_histogram) the information is known because the client provides the interval, and we provide the start key for each bucket.

As you mentioned, some aggregations pick the interval themselves (auto_date_histogram, and in the future an equivalent for numeric histograms). But these aggs return the chosen interval, which allows the client to work out any labels they want. Most major languages have various tools/libraries to work out times and durations. E.g. in JS you could use MomentJS, a tiny bit of string parsing and some date math to find the center of a bucket

I'll mark this as team-discuss and we'll talk about it at our next team meeting. My feeling is that this should be an enhancement request for Kibana to allow more flexibility with respect to histogram labels, but that there's nothing to be done on the ES side.

gk-patel commented 5 years ago

Hi @polyfractal

I agree with you, that this problem can be solved at two levels (either in ES or in the frontend client). But, I would appreciate it, if it were solved at ES level.

I would like to provide a suggestion on how this can be tackeled on ES side. Let us consider the official histogram documentation, it can be see that there are number of parameters which can be passed to ES for altering the behaviour/format of output -- some examples are min_doc_count, missing, keyed, etc. Similarly, you guys can also add key_at_center_of_bucket as a parameter which can be passed as an argument telling the ES to set the key as the mean of start and end of the bucket boundary. In this way, this parameter will be optional, and will be used by people who need it.

Thank you for your consideration.

gk-patel commented 4 years ago

Hi @polyfractal,

any updates on this issue ? Thanks in advance.

polyfractal commented 4 years ago

Hi, sorry for the (very long) delay. We talked about this as a team a while ago and decided to not implement it, sorry :( We understand that different users will want to plot things in a different manner, but didnt think this was worth the complexity of introducing more options into Elasticsearch itself (parameters, different code paths, tests, maintenance burden, etc). We think this is probably something that the consuming client should deal with, be it Kibana or a custom application.

Thanks for the enhancement request, and apologies for the delay!