[TSDB] Counter missing aggregations that are unsupported

constanca-m commented 1 year ago

Summary

Currently, only last_value and max are supported for counter metrics when TSDB is enabled. However this does not seem to be enough for "overview" visualizations: Example: For the HTTP metrics, how to build a visualization that provide this information:

Overview Visualization for Total Number of HTTP Requests (Clustered by Label) A high-level visual representation that clusters or groups the field values based on a specific label. For example, it could involve grouping the sum of HTTP methods such as GET, POST, PUT, and DELETE. This visualization provides a broad understanding of the overall volume of HTTP requests across different categories.

Detailed example:

Let's consider an example data stream that expects documents with four parameters:

timestamp: date
method: keyword
count: integer <- type counter
host: keyword

For this example data stream we have these documents:

Timestamp 1:

Method	Count	Host
GET	100	host-1
POST	53	host-1
PUT	2	host-1
DELETE	23	host-1

Timestamp 2:

Method	Count	Host
GET	257	host-1
POST	60	host-1
PUT	13	host-1
DELETE	28	host-1

Timestamp 3:

Method	Count	Host
GET	346	host-1
POST	83	host-1
PUT	37	host-1
DELETE	52	host-1

Given these documents, our example user decided that they wanted to visualize the total sum of requests per timestamp:

However, the user enabled TSDB and realized that sum is no longer supported, and they could only choose between max or last_value. Without much knowledge on the meaning of the visualization, they chose last_value:

This visualization is obviously not correct. The user needs to break this by the field method, since that is the difference between those documents:

The problem seems to be "fixed", although the meaning of the visualization completely changed. But what if the index had two different hosts? If, for every timestamp, we receive a document for each method the customer would have to breakdown the visualization with yet another field: host. And now the visualization looks like this:

And this breakdown could go on: if the document is split not only for the host and method, but also for the status (200, 400, 404 etc), for example. There is also a limit for breakdown on Kibana, so at some point it would no longer be possible to even visualize a metric if there are too many labels (limit is 4). Realistic case where this happened: https://github.com/elastic/integrations/pull/6171.

What started as the "Total sum" for every request ended up completely different.

This case is just an example on why:

The dashboard gets overly complicated and close to impossible to visualize
The visualization using sum for counter (for example) loses meaning and there is no workaround
Overview dashboards - ie, dashboards that use aggregations to visualize a metric without going into much depth by breaking the metric through the labels - help to give a clear overview on what is wrong. How do we do that now?
We cannot automate dashboards because there is no way to know what is the new meaning of a visualization if an aggregation becomes unsupported.

mlunadia commented 1 year ago

@martijnvg @felixbarny @lalit-satapathy to consider as part of the discussion earlier today.

tetianakravchenko commented 1 year ago

for the vizualisations that were deleted in k8s integration before the TSDB migration - https://github.com/elastic/integrations/pull/6171, there was used formula:

"formula": "(sum(kubernetes.apiserver.request.duration.us.sum)/ sum(kubernetes.apiserver.request.duration.us.count))/1000"

to represent Average Apiserver Request Latency per Resource, that is not possible to use anymore, since both kubernetes.apiserver.request.duration.us.sum and kubernetes.apiserver.request.duration.us.count are counters.

How to provide the vizualisation with the same meaning - Average Apiserver Request Latency per Resource now?

elasticmachine commented 1 year ago

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

dej611 commented 1 year ago

Partial overlap with #146733

constanca-m commented 1 year ago

It is happening again with AWS Fargate:

Both these visualizations are using counter to try to get an overview of these metrics.

What should be the workaround? How to represent this now?

felixbarny commented 1 year ago

We'll probably want to show the rate of the counter, not the value of the counter.

lalit-satapathy commented 1 year ago

Given these documents, our example user decided that they wanted to visualize the total sum of requests per timestamp: However, the user enabled TSDB and realized that sum is no longer supported, and they could only choose between max or last_value. Without much knowledge on the meaning of the visualization, they chose last_value:

Sum(count) does not make sense, because it is a counter. That’s not what is also needed here. What we intend here is adding 4 separate values together using mathematical sum.

Similar to below can be explored:

last_value(counter) for Method GET + last_value(counter) for Method POST + last_value(counter) for Method PUT + last_value(counter) for Method DELETE

Expression like Method:GET can be embedded in a kql query. @constanca-m Can we explore this?

constanca-m commented 1 year ago

last_value(counter) for Method GET + last_value(counter) for Method POST + last_value(counter) for Method PUT + last_value(counter) for Method DELETE

Problem 1: We have to specify the value of each label for this to work. What if we have several labels? In this case, host and method. If I have 2 hosts and 4 methods, then I have to explicitly set those values. How do I know in advance what is the value of a host?

Problem 2: If I try to filter a metric by a label that does not exist, the visualization will break. Who's to say that we will have documents for each label at every timestamp?

lalit-satapathy commented 1 year ago

@felixbarny @martijnvg @giladgal

We finally have a simple example below which depicts the gap in counter aggregation that we have currently. Please advise how to solve the issue below.

#

Let’s say there is a counter metric by name http_requests_total with dimension host (3 values: “host1”, “host2”, “host3”). This will create 3 counter time series for “host1” “host2” “host3”

Sum(http_requests_total) does not make sense for a given time series, because that is obvious, as we can't sum a counter.

However, We would like to find the http_requests_total at any given point of time for all the hosts combined. That is a valid usage.

This can be achieved by mathematically adding all the last_values of http_requests_total, which is essentially as below:

_last_value(http_requests_total)(host:”host1”) + last_value(http_requests_total)(host:”host2”) + last_value(http_requeststotal)(host:”host3”)

This is easy, since here we know all the dimension values and can mathematically add them.

What if the values of the host can be a large list and we don't know in advance what all values of host has and we would still like to add all the last values of http_requests_total. We don't have to know in advance what the values of the host here to ask this question below:

We would like to find the http_requests_total at any given point of time for all the hosts combined

What aggregations or Lens formula, we can use in elasticsearch/Kibana to get this data above?

CC: @mlunadia @constanca-m @tetianakravchenko

felixbarny commented 1 year ago

We would like to find the http_requests_total at any given point of time for all the hosts combined.

I agree that this should be possible using the time_series aggregation. @martijnvg please chime in if that's currently possible. But even if it is, there's a good chance that this isn't possible using Lens.

However, I'm challenging that this is what we should show in the dashboard. Instead of showing the total http request for all hosts, we should rather show the total throughput in requests per minute across all hosts.

dej611 commented 4 weeks ago

ES|QL will expose TSDB capabilities to solve this issue, so re-assigning to the ES|QL team.

elasticmachine commented 4 weeks ago

Pinging @elastic/kibana-esql (Team:ESQL)

stratoula commented 4 weeks ago

It doesnt make sense to move it to the ES|QL team. When TSDB is supported in ES|QL then this will be automatically solved. We can close it but we don't add to the team issues that will be solved with ES|QL

elastic / kibana

[TSDB] Counter missing aggregations that are unsupported #161369

Summary

Detailed example: