Some incorrect assumptions in timeseries processing.

dbaarda commented 2 weeks ago

I wish people writing timeseries databases would read and understand all the rrdtool documentation first. That tool contains a huge amount of accumulated knowledge/wisdom about how to handle timeseries efficiently and accurately.

In https://docs.victoriametrics.com/keyconcepts/#instant-query the documentation talks about how queries for timeseries values at a particular time look for the datapoint before that time.

However, in general timeseries datapoints better represent the time period before, so it's more accurate to use the datapoint after the requested time. For rate timeseries with points at regular intervals, each point represents the average rate for the period between that point and the previous point. Using the rate datapoint after the requested time gives you the best possible instantaneous rate estimate for that time. However, if you are querying a rate1m timeseries (points every minute that represent the average rate over the previous minute) and want the best rate1m at a particular time (average rate over the minute before the particular time), then the most accurate estimate you can get is by linearly interpolating between the point after and before the requested time.

This "datapoints represent the time before the datapoint, not after" behaviour is definitely true for rates, but is also generally true for gauges and counters too. In general linearly interpolating between points gives the most accurate estimate.

When there are missing datapoints, it becomes even more crucial to do this right, as you are approximating over longer time intervals, so using the wrong approximation makes the answer even more wrong. Just returning the value of points before the requested time as if it was at the requested time is time-shifting the value to an arbitrary, up to your step interval, later time. It would be better to provide the actual timestamp in the answer, but I'm not sure frontends would be happy with that.

If the timeseries lookups are time-shifting data-points like this by arbitrary times, then any queries that combine different timeseries will be using values taken at different times. This can be the source of many problems, like error rates greater than 100%.

Doing this correctly would answer many of the problems trying to be addressed with https://docs.victoriametrics.com/keyconcepts/#query-latency.

valyala commented 2 weeks ago

Agreed with the reasoning. Unfortunately, the current behaviour cannot be changed in existing software like VictoriaMetrics, since this will break expectations of many users, who rely on the current behaviour :(

On the other hand, VictoriaMetrics can accept an option at startup or at query time, which would tell it to shift the calculation time by step, in order to return the more expected result.

@dbaarda , could you file a feature request at VictoriaMetrics issue tracker, so it could be tracked and prioritized?

As a temporary workaround, it is possible to add the needed time shift manually via offset modifier. For example, the following query should return the expected average per-second increase rate on the time range (time ... time+step] when the query is executed at timestamp=time:

rate(m[1i] offset -1i)

The 1i is automatically converted to step between points on the graph.

dbaarda commented 2 weeks ago

Thanks for the quick followup.

Using the offset modifier gives you a fixed offeset that kind-of compensates for the "point represents interval before sample, not after" problem, but it's a fixed offset. This means it doesn't compensate for variations in the sampling interval, or differences in the sampling time offsets of different timeseries being combined in operations. This means it cannot fix the >100% errors artifacts. which are far more obvious than a 1m timeshift.

So I'm not sure just adding a time shift is actually going to make things better.

fuslwusl / homeassistant-addon-victoriametrics

Some incorrect assumptions in timeseries processing. #44