Use native prometheus aggregations instead of stats package

smalleni commented 1 year ago

Looking at https://github.com/cloud-bulldozer/go-commons/blob/main/prometheus/prometheus.go#L89 which is used for the kube-burner ocp wrapper reporting mode, it looks like we are extracting all the datapoints over the timeseries and then feeding them into methods in stats package as per the aggregation required. It looks like prometheus supports aggregations natively (https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time) and we should switch over to using those for better consistency and confidence in our aggregations.

smalleni commented 1 year ago

@vishnuchalla @rsevilla87 @jtaleric FYI

rsevilla87 commented 1 year ago

mm, this is interesting, I've digged a little in the avg_over_time() functions and I think we may use them by combining instant queries and using a time range with a colon notation in order to get the aggregated values up to the passed timestamp:

Example:

$ curl -s 'http://demo.robustperception.io:9090/api/v1/query?query=avg_over_time%28rate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%5B10m%3A%5D%29&time=1691492504'    | jq .
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "cpu": "0",
          "env": "demo",
          "instance": "demo.do.prometheus.io:9100",
          "job": "node",
          "mode": "idle"
        },
        "value": [
          1691492504,
          "0.8361428571421476"
        ]
      }
    ]
  }
}

The above urlencoded query is actually avg_over_time(rate(node_cpu_seconds_total{mode="idle"}[2m])[10m:])

rsevilla87 commented 1 year ago

I still have to verify whether this value is valid or not

rsevilla87 commented 1 year ago

More tests:

Query: max_over_time(rate(node_forks_total[2m])[5h:])

# Using an instant query
$ curl -s 'http://demo.robustperception.io:9090/api/v1/query?query=max_over_time%28rate%28node_forks_total%5B2m%5D%29%5B5h%3A%5D%29&time=1691492544'    | jq .
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "env": "demo",
          "instance": "demo.do.prometheus.io:9100",
          "job": "node"
        },
        "value": [
          1691492544,
          "2.400068573387811"
        ]
      }
    ]
  }
}

And now using the prometheus GUI to perform a query_range and verify that value is present among the plotted datapoints

Both values match, hence, I think we replace the kube-burner report's expressions by these ones.

Practical example

in a hypothetical workload that lasted 3721 seconds and finished at Tue Aug 8 01:57:42 we could calculate the average node cpu usage over the workload duration with an promql query like:

GET
query=avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]))[3721s:])&timestamp=1691452662

vishnuchalla commented 1 year ago

Yes we can fetch the aggregations over time and also within in a given time range

curl -kH "Authorization: Bearer TOKEN" 'PROM_URL/api/v1/query?query=avg_over_time(sum(irate(node_cpu_seconds_total[2h]))[2h:1h])'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1691553906.635,"103.71599999999302"]}]}}

But the only thing missing from the existing function is the step (sample collection interval), which I don't think we need here while calculating aggregations. Correct me If I am wrong!

rsevilla87 commented 1 year ago

With the two previous PRs I think we can close this one

cloud-bulldozer / go-commons

Use native prometheus aggregations instead of stats package #30