elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.69k stars 8.12k forks source link

[TSDB] Counter missing aggregations that are unsupported #161369

Open constanca-m opened 1 year ago

constanca-m commented 1 year ago

Summary

Currently, only last_value and max are supported for counter metrics when TSDB is enabled. However this does not seem to be enough for "overview" visualizations: Example: For the HTTP metrics, how to build a visualization that provide this information:

Detailed example:

Let's consider an example data stream that expects documents with four parameters:

  1. timestamp: date
  2. method: keyword
  3. count: integer <- type counter
  4. host: keyword

For this example data stream we have these documents:

  1. Timestamp 1:
Method Count Host
GET 100 host-1
POST 53 host-1
PUT 2 host-1
DELETE 23 host-1
  1. Timestamp 2:
Method Count Host
GET 257 host-1
POST 60 host-1
PUT 13 host-1
DELETE 28 host-1
  1. Timestamp 3:
Method Count Host
GET 346 host-1
POST 83 host-1
PUT 37 host-1
DELETE 52 host-1

Given these documents, our example user decided that they wanted to visualize the total sum of requests per timestamp:

image

However, the user enabled TSDB and realized that sum is no longer supported, and they could only choose between max or last_value. Without much knowledge on the meaning of the visualization, they chose last_value:

image

This visualization is obviously not correct. The user needs to break this by the field method, since that is the difference between those documents:

image

The problem seems to be "fixed", although the meaning of the visualization completely changed. But what if the index had two different hosts? If, for every timestamp, we receive a document for each method the customer would have to breakdown the visualization with yet another field: host. And now the visualization looks like this:

image

And this breakdown could go on: if the document is split not only for the host and method, but also for the status (200, 400, 404 etc), for example. There is also a limit for breakdown on Kibana, so at some point it would no longer be possible to even visualize a metric if there are too many labels (limit is 4). Realistic case where this happened: https://github.com/elastic/integrations/pull/6171.

What started as the "Total sum" for every request ended up completely different.

This case is just an example on why:

  1. The dashboard gets overly complicated and close to impossible to visualize
  2. The visualization using sum for counter (for example) loses meaning and there is no workaround
  3. Overview dashboards - ie, dashboards that use aggregations to visualize a metric without going into much depth by breaking the metric through the labels - help to give a clear overview on what is wrong. How do we do that now?
  4. We cannot automate dashboards because there is no way to know what is the new meaning of a visualization if an aggregation becomes unsupported.
mlunadia commented 1 year ago

@martijnvg @felixbarny @lalit-satapathy to consider as part of the discussion earlier today.

tetianakravchenko commented 1 year ago

for the vizualisations that were deleted in k8s integration before the TSDB migration - https://github.com/elastic/integrations/pull/6171, there was used formula:

"formula": "(sum(kubernetes.apiserver.request.duration.us.sum)/ sum(kubernetes.apiserver.request.duration.us.count))/1000"

to represent Average Apiserver Request Latency per Resource, that is not possible to use anymore, since both kubernetes.apiserver.request.duration.us.sum and kubernetes.apiserver.request.duration.us.count are counters.

elasticmachine commented 1 year ago

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

dej611 commented 1 year ago

Partial overlap with #146733

constanca-m commented 1 year ago

It is happening again with AWS Fargate:

image

Both these visualizations are using counter to try to get an overview of these metrics.

What should be the workaround? How to represent this now?

felixbarny commented 1 year ago

We'll probably want to show the rate of the counter, not the value of the counter.

lalit-satapathy commented 1 year ago

Given these documents, our example user decided that they wanted to visualize the total sum of requests per timestamp: However, the user enabled TSDB and realized that sum is no longer supported, and they could only choose between max or last_value. Without much knowledge on the meaning of the visualization, they chose last_value:

Sum(count) does not make sense, because it is a counter. That’s not what is also needed here. What we intend here is adding 4 separate values together using mathematical sum.

Similar to below can be explored:

last_value(counter) for Method GET + last_value(counter) for Method POST + last_value(counter) for Method PUT + last_value(counter) for Method DELETE

Expression like Method:GET can be embedded in a kql query. @constanca-m Can we explore this?

constanca-m commented 1 year ago

last_value(counter) for Method GET + last_value(counter) for Method POST + last_value(counter) for Method PUT + last_value(counter) for Method DELETE

Problem 1: We have to specify the value of each label for this to work. What if we have several labels? In this case, host and method. If I have 2 hosts and 4 methods, then I have to explicitly set those values. How do I know in advance what is the value of a host?

Problem 2: If I try to filter a metric by a label that does not exist, the visualization will break. Who's to say that we will have documents for each label at every timestamp?

lalit-satapathy commented 1 year ago

@felixbarny @martijnvg @giladgal

We finally have a simple example below which depicts the gap in counter aggregation that we have currently. Please advise how to solve the issue below.

#

Let’s say there is a counter metric by name http_requests_total with dimension host (3 values: “host1”, “host2”, “host3”). This will create 3 counter time series for “host1” “host2” “host3”

Screenshot 2023-07-07 at 5 53 01 PM

Sum(http_requests_total) does not make sense for a given time series, because that is obvious, as we can't sum a counter.

However, We would like to find the http_requests_total at any given point of time for all the hosts combined. That is a valid usage.

This can be achieved by mathematically adding all the last_values of http_requests_total, which is essentially as below:

_last_value(http_requests_total)(host:”host1”) + last_value(http_requests_total)(host:”host2”) + last_value(http_requeststotal)(host:”host3”)

This is easy, since here we know all the dimension values and can mathematically add them.

What if the values of the host can be a large list and we don't know in advance what all values of host has and we would still like to add all the last values of http_requests_total. We don't have to know in advance what the values of the host here to ask this question below:

We would like to find the http_requests_total at any given point of time for all the hosts combined

What aggregations or Lens formula, we can use in elasticsearch/Kibana to get this data above?

CC: @mlunadia @constanca-m @tetianakravchenko

felixbarny commented 1 year ago

We would like to find the http_requests_total at any given point of time for all the hosts combined.

I agree that this should be possible using the time_series aggregation. @martijnvg please chime in if that's currently possible. But even if it is, there's a good chance that this isn't possible using Lens.

However, I'm challenging that this is what we should show in the dashboard. Instead of showing the total http request for all hosts, we should rather show the total throughput in requests per minute across all hosts.

dej611 commented 4 weeks ago

ES|QL will expose TSDB capabilities to solve this issue, so re-assigning to the ES|QL team.

elasticmachine commented 4 weeks ago

Pinging @elastic/kibana-esql (Team:ESQL)

stratoula commented 4 weeks ago

It doesnt make sense to move it to the ES|QL team. When TSDB is supported in ES|QL then this will be automatically solved. We can close it but we don't add to the team issues that will be solved with ES|QL