Open IzakMarais opened 9 years ago
Related is #462 which is another issue with downsampling due to the order of operations
@manolama Pretty curious to hear your thoughts on this one. The order of operations here is pretty important to me since most of our data is counters, so I need to get an idea of the direction OpenTSDB is going to go with this.
(I was planning to fix this but not before #465 gets merged.)
Playing around with a spreadsheet it looks like moving rates before downsampling would get us closer to the truth though you'll still have those spikes untio #465. https://docs.google.com/spreadsheets/d/1Wr0w3XGrQBpkJbgsCBo8yptOdMuQU9tuMGSAC-VHSqs/edit?usp=sharing The biggest drawback is that when aggregating series with a high cardinality, we'll perform many more calculations. I'll keep poking at the spreadsheet with other examples but let me know what ya'll think. Thanks.
Being able to get downsampling with counters is pretty high on my checklist. For instance if I want to get an idea of resource utilization, I might want to look over a few weeks and and see what the max was every 10 minutes or so.
So it is high impact for me not to be able to display graphs like that without slurping all the data into something like pandas and doing my downsampling there.
Playing around with a spreadsheet it looks like moving rates before downsampling would get us closer to the truth though you'll still have those spikes untio
It looks like you are not using the counter feature in those spreadsheets to suppress the spikes? I updated the issue title to reflect the actual bug. When using openTSDB's counters feature, I expect the spikes to be suprressed, but this bug causes them not to be.
However, if you expect #465 to also solve this, I would be fine with that. (We have started avoiding rates altogether, since having to specify the counter rollover value adds too much of a burden when creating graphs with grafana.)
@manolama Also, won't this mess up aggregation as well with Max? It is just going to be whatever counter is highest, which won't reflect the "top" rate, but rather whatever count just happens to be highest due to up time?
For reference, see the thread Counters, Order of Operations, and Results that nobody would expect on the mailing list.
+1
same issue here! investigating "network counters" I need to be able to see the spikes, and instead on the downsampling they get flattened out...
@manolama Any updates on this or #462 ?
Based on the following recommendation in the documentation, we recently switched from doing our own rate calculation prior to submitting data points to letting Opentsdb handle rate calculations:
However, now we see a bug in OpenTSDB's implmentation that is clearly visible in the following grafana graphs.
We have the following raw counter data. If we use OpenTSDB's rate function, we get the (correct) output like so:
However, if we ask OpenTSDB to downsample the data, sometimes the sampling point can intersect the counter rollover interval: As expected, OpenTSDB's rate calculations do not handle this point well; applying the rate caluculation after the downsampling gives an incorrect spike:
Perhaps the order outlined in the documentation:
should be different? It looks like rate calculation should be earlier in the pipeline, before downsampling.