blue-yonder / azure-cost-mon

Prometheus exporter for the Azure billing API
MIT License
61 stars 13 forks source link

Constant costs might decrease due to numerical instabilities #12

Open StephanErb opened 7 years ago

StephanErb commented 7 years ago

Exported data contains long floating point numbers such as 0.0000032120027740023963. Due to the nature of the floating points numbers, aggregating those can lead to unstable results as the addition is non-commutative. This is problematic as it prevents us from defining meaningful aggregation rules over exported cost data.

Please see https://github.com/prometheus/prometheus/issues/2951 for details.

We don't need full precision here, so we should round the results to 2 or 3 digits before emitting.

StephanErb commented 7 years ago

Adopted workaround:

# FIXME: round is a workaround for https://github.com/blue-yonder/azure-cost-mon/issues/12
job:azure_costs_eur:sum =
    round(sum(azure_costs_eur))
ManuelBahr commented 7 years ago

The problem is even worse. Due to these instabilities it could happen that counters got smaller values when it should be constant. This results in a counter reset and totally invalidates the results of the increase or rate functions within Prometheus.

ManuelBahr commented 7 years ago

Released v0.4.1 to fix the issue.

ManuelBahr commented 7 years ago

Going to integers reduced the probability of the problem to occur, but we might still see flaps and counter resets as a result. Essentially, these are items that are constant in reality (resources that have been decommissioned) but the resulting value flaps around an integer border.

StephanErb commented 7 years ago

We are using the following aggregation rules as a workaround right now:

#
# Azure
#
# We ignore the Prometheus rule format here by not specifying the lables in the series before
# the first column. We simply don't really know which ones are there. In any case, we still need
# the recording rule as increase/changes over 2 days are costly to compute for plots.
#

# Cost increase over the last 2 days
# We cannot use the normal increase function here as the Azure API is providing slighly
# fluctuating costs. Those would be interpreted as counter resets, leading to wrong results.
azure_costs_eur:increase2d =
  (azure_costs_eur - azure_costs_eur offset 2d)

# Number of updates from the Azure API over the last 2 days. The Azure API is providing changes once
# a day but not at the same time. So we expect this value to be either 1 or 2.
azure_costs_eur:changes2d =
  changes(azure_costs_eur[2d])

# This metric shows our total daily costs. Due to the slow moving counters provided by the Azure API,
# the value is computed as the average over the last 2 days. In Prometheus speak, we emit
# the average observation size over a 2 day time period. As we only have ~1 change per day
# this is our daily costs.
job:azure_costs_eur:mean2d =
  sum(
      (azure_costs_eur:increase2d > 0)
    /
      (azure_costs_eur:changes2d  > 0) # We need the > 0 filter to prevent the propagation of NaN.
  )
ManuelBahr commented 7 years ago

I think i ruled out that the non-commutativity of floats is the problem. Could track down the flaps to occur only during updates of the API. Also, I wasn't able to reproduce the flapping with python floats. So I suspect the server-side has the issue and not the code that aggregates. However rounding or truncating in some way might fix it.