OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
5k stars 1.25k forks source link

OpenTSDB /api/query bug when returning rate counted values #947

Open davidgomes opened 7 years ago

davidgomes commented 7 years ago

We have encountered a bug with OpenTSDB where the /api/query endpoint will return a different value (in the order of 10^15, 10^17, etc.) from the one reported in the TSDB GUI for the exact same query. This happens when using Rate + Rate Counter on a monotonically increasing metric.

The TSDB UI shows the following chart for a given metric and a given time range.

tsdb1

The metric in question is CPU Utilization (os.cpu from scollector). It is a monotonically increasing value (collected from /proc/stat), hence we need to use the Rate option in TSDB to show deltas. When a machine is shut down counter goes back to 0 which means a negative delta occurs when the machine comes back up. This is fine, because we can just use OpenTSDB's Rate Counter to avoid this:

tsdb2

In this image we can see that the TSDB Rate Counter option seems to solve this problem quite nicely. However, when querying the /api/query endpoint, we don't get the same behavior:

curl -v https://URL/api/query --data '{"start": 1486430873, "queries": [{ "metric": "os.cpu", "rate": true, "rateOptions": { "counter": true }, "tags": { "stack": "stack", "host": "host-name" }, "aggregator": "avg" }] }' -H "Content-Type: application/json"

This endpoint returns, for the timestamp in question:

      "1489612925": 0.6333333333333333,
      "1489612940": 0.5333333333333333,
      "1489612955": 1.8333333333333333,
      "1489612970": 0.5,
      "1489616968": 2306996507467345.5,
      "1489616983": 11.866666666666667,
      "1489616998": 0.4,
      "1489617013": 0.35,
      "1489617028": 0.36666666666666664,
      "1489617043": 0.38333333333333336,
      "1489617058": 0.36666666666666664,

The most important value in the /api/query "dps" output is the following:

      "1489616968": 2306996507467345.5,

This unix epoch corresponds to the 22:30 hour in the second image which shows around 12 in the TSDB UI. The exact same query (same timestamp, same aggregator, same tags, same rate and rate options, no downsampling) in /api/query is returning a much different value than what is being reported in the UI.

We are not sure what is going on but suspect that it might be an int overflow issue or a bug either in /api/query (returning too large value) or in the TSDB UI (returning too low value).

If we don't use rate counter then the values do seem to match (but they are negative and we need >= 0 values for the CPU Utilization metric).

manolama commented 7 years ago

Sorry for the long wait. It could be that Gnuplot is kicking out the errant value. If possible could you capture the gnuplot file (defaults to the /tmp directory) so we can see what value is written there? Thanks.