Bug when combining downsampling and rate calculation with counters

IzakMarais commented 9 years ago

Based on the following recommendation in the documentation, we recently switched from doing our own rate calculation prior to submitting data points to letting Opentsdb handle rate calculations:

If something is a counter, or is naturally something that is a rate, don't convert it to a rate before sending it to the TSD. ...

However, now we see a bug in OpenTSDB's implmentation that is clearly visible in the following grafana graphs.

We have the following raw counter data. raw data If we use OpenTSDB's rate function, we get the (correct) output like so: rate calculation only

However, if we ask OpenTSDB to downsample the data, sometimes the sampling point can intersect the counter rollover interval: down sample only As expected, OpenTSDB's rate calculations do not handle this point well; applying the rate caluculation after the downsampling gives an incorrect spike: down sample followed by rate

Perhaps the order outlined in the documentation:

Grouping Down Sampling Interpolation Aggregation Rate Calculation,

should be different? It looks like rate calculation should be earlier in the pipeline, before downsampling.

kylebrandt commented 9 years ago

Related is #462 which is another issue with downsampling due to the order of operations

kylebrandt commented 9 years ago

@manolama Pretty curious to hear your thoughts on this one. The order of operations here is pretty important to me since most of our data is counters, so I need to get an idea of the direction OpenTSDB is going to go with this.

oozie commented 9 years ago

(I was planning to fix this but not before #465 gets merged.)

manolama commented 9 years ago

Playing around with a spreadsheet it looks like moving rates before downsampling would get us closer to the truth though you'll still have those spikes untio #465. https://docs.google.com/spreadsheets/d/1Wr0w3XGrQBpkJbgsCBo8yptOdMuQU9tuMGSAC-VHSqs/edit?usp=sharing The biggest drawback is that when aggregating series with a high cardinality, we'll perform many more calculations. I'll keep poking at the spreadsheet with other examples but let me know what ya'll think. Thanks.

kylebrandt commented 9 years ago

Being able to get downsampling with counters is pretty high on my checklist. For instance if I want to get an idea of resource utilization, I might want to look over a few weeks and and see what the max was every 10 minutes or so.

So it is high impact for me not to be able to display graphs like that without slurping all the data into something like pandas and doing my downsampling there.

IzakMarais commented 9 years ago

Playing around with a spreadsheet it looks like moving rates before downsampling would get us closer to the truth though you'll still have those spikes untio

It looks like you are not using the counter feature in those spreadsheets to suppress the spikes? I updated the issue title to reflect the actual bug. When using openTSDB's counters feature, I expect the spikes to be suprressed, but this bug causes them not to be.

However, if you expect #465 to also solve this, I would be fine with that. (We have started avoiding rates altogether, since having to specify the counter rollover value adds too much of a burden when creating graphs with grafana.)

kylebrandt commented 9 years ago

@manolama Also, won't this mess up aggregation as well with Max? It is just going to be whatever counter is highest, which won't reflect the "top" rate, but rather whatever count just happens to be highest due to up time?

tsuna commented 9 years ago

For reference, see the thread Counters, Order of Operations, and Results that nobody would expect on the mailing list.

zeph commented 8 years ago

+1

same issue here! investigating "network counters" I need to be able to see the spikes, and instead on the downsampling they get flattened out...

kylebrandt commented 8 years ago

@manolama Any updates on this or #462 ?

OpenTSDB / opentsdb

Bug when combining downsampling and rate calculation with counters #476