free / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
48 stars 5 forks source link

Functional difference between increase and xincrease #10

Closed Thomath closed 4 years ago

Thomath commented 5 years ago

Hello free,

first of all I wanted to thank you very much for all your efforts in the prometheus community!

As I'm using a Prometheus+Grafana setup myself right now, mostly with counters, I have to rely on counter functions especially the increase function as I want to see "real differences" of my counters. In that context I came across your fork with the xincrease function and the underlying problem of the original increase function. Yet I'm not sure if I fully understand your implemention of the function. I had a look at that blog post which resembles the same problem with the following example:

The rate function uses the rate interval to extract subsets of samples, as we've previously determined. With a rate interval of 60 seconds, and an original scrape interval of 15 seconds, it would look at sets of 4 samples at a time. It has to do an extrapolation, because strictly speaking, 4 samples covers a 45 second interval (3 gaps of 15 seconds each, another manifestation of the 'number of fenceposts' problem). It does this by assuming that the rate of change over the 45 seconds would extend to the full 60 seconds. The trouble comes when our step is also 60 seconds. Consider the set of samples:

t: 4 4 4 4 6 6 6 6 9 10 12 13

The following would happen for increase[60s]: increase[t1] = (4-4)+(4 -4)+(4-4) = 0 increase[t2] = (6-6)+(6-6)+(6-6) = 0 increase[t3] = (10-9)+(12-10)+(13-12) = 4

I would lose the information between those windows. So what exactly does your implementation differently? As I understood it takes a sample more in the beginning and the end so the result would be:

increase[t1] = (4-x)+(4-4)+(4 -4)+(4-4)+(6-4) increase[t2] = (6-4)+(6-6)+(6-6)+(6-6)+(9-6) increase[t3] = (9-6)+(10-9)+(12-10)+(13-12)+(y-13)

Is that what happens? Furthermore, if that is the case what happens if the grafana step size is bigger than the window size for the increase function. Wouldn't that lead just to a same the result as the standard increase function just with more values taken into consideration?

Sorry for the rather long question for this simple issue. Thank you very much in advance.

free commented 5 years ago

As I understood it takes a sample more in the beginning and the end so the result would be:

increase[t1] = (4-x)+(4-4)+(4 -4)+(4-4)+(6-4) increase[t2] = (6-4)+(6-6)+(6-6)+(6-6)+(9-6) increase[t3] = (9-6)+(10-9)+(12-10)+(13-12)+(y-13)

Is that what happens?

That is exactly what happens, yes.

Furthermore, if that is the case what happens if the grafana step size is bigger than the window size for the increase function. Wouldn't that lead just to a same the result as the standard increase function just with more values taken into consideration?

Correct again. But I would like to add a couple of observations.

One, Grafana has a very useful $__interval variable that you can use in your dashboards. For something like a disk_usage gauge, you could display something like max_over_time(disk_usage[$__interval]) to ensure you were including all samples, not merely downsampling the time series and possibly missing significant spikes (as you would if you simply displayed disk_usage). This would include every single sample exactly once (unless it happened to be on the exact second between two adjacent ranges). There is no equivalent for rate() and increase(), hence xrate() and xincrease(). Which behave even better than the <aggregation>_over_time functions, as every increase is only included once, even if the sample happens to fall on the exact range boundary.

Second, even if you weren't using Grafana's $__interval, you can still get pretty big aliasing artifacts from using rate() or increase() with poorly chosen ranges (and even with well chosen ranges as long as they are not many multiples of the smallest useful range). E.g. if you have a counter that only increases once a minute but you're collecting it more often than that then a rate[1m] may happen to cover the increase (in which case the extrapolation will over-estimate the rate/increase) or it may happen to miss the increase (in which case you'll get a series of zeroes). If the process that produces the said counter restarts or gets delayed, you may even get a graph that's e.g. 1.2 (for an actual increase 1) for half the time and 0 for the other half. Even if you increase the range to say 5m you'd still get 5.04 for the left half and 4.04 for the right half instead of what should be a continuous 4 with maybe a falling knife in the middle.

My position on this is that throwing away data that's actually available and replacing it with an estimate (rate's extrapolation) can't possibly produce better results than using said data when available.