Closed smalleni closed 1 year ago
@vishnuchalla @rsevilla87 @jtaleric FYI
mm, this is interesting, I've digged a little in the avg_over_time() functions and I think we may use them by combining instant queries and using a time range with a colon notation in order to get the aggregated values up to the passed timestamp:
Example:
$ curl -s 'http://demo.robustperception.io:9090/api/v1/query?query=avg_over_time%28rate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%5B10m%3A%5D%29&time=1691492504' | jq .
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"cpu": "0",
"env": "demo",
"instance": "demo.do.prometheus.io:9100",
"job": "node",
"mode": "idle"
},
"value": [
1691492504,
"0.8361428571421476"
]
}
]
}
}
The above urlencoded query is actually avg_over_time(rate(node_cpu_seconds_total{mode="idle"}[2m])[10m:])
I still have to verify whether this value is valid or not
More tests:
Query: max_over_time(rate(node_forks_total[2m])[5h:])
# Using an instant query
$ curl -s 'http://demo.robustperception.io:9090/api/v1/query?query=max_over_time%28rate%28node_forks_total%5B2m%5D%29%5B5h%3A%5D%29&time=1691492544' | jq .
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"env": "demo",
"instance": "demo.do.prometheus.io:9100",
"job": "node"
},
"value": [
1691492544,
"2.400068573387811"
]
}
]
}
}
And now using the prometheus GUI to perform a query_range and verify that value is present among the plotted datapoints
Both values match, hence, I think we replace the kube-burner report's expressions by these ones.
Practical example
in a hypothetical workload that lasted 3721 seconds and finished at Tue Aug 8 01:57:42
we could calculate the average node cpu usage over the workload duration with an promql query like:
GET
query=avg_over_time(sum(irate(node_cpu_seconds_total{mode!="idle", mode!="steal"}[2m]))[3721s:])×tamp=1691452662
Yes we can fetch the aggregations over time and also within in a given time range
curl -kH "Authorization: Bearer TOKEN" 'PROM_URL/api/v1/query?query=avg_over_time(sum(irate(node_cpu_seconds_total[2h]))[2h:1h])'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1691553906.635,"103.71599999999302"]}]}}
But the only thing missing from the existing function is the step
(sample collection interval), which I don't think we need here while calculating aggregations. Correct me If I am wrong!
With the two previous PRs I think we can close this one
Looking at https://github.com/cloud-bulldozer/go-commons/blob/main/prometheus/prometheus.go#L89 which is used for the kube-burner ocp wrapper reporting mode, it looks like we are extracting all the datapoints over the timeseries and then feeding them into methods in stats package as per the aggregation required. It looks like prometheus supports aggregations natively (https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time) and we should switch over to using those for better consistency and confidence in our aggregations.