Netflix-Skunkworks / spectatord

A high performance metrics daemon
Apache License 2.0
24 stars 5 forks source link

Disable Large Value Reporting for Unexpected Overflow Conditions in MonotonicCounterUint #98

Closed copperlight closed 4 weeks ago

copperlight commented 4 weeks ago

We have a situation where a Go process uses the MonotonicCounterUint type to report values to Atlas, but the process restarts periodically, which then looks like an overflow and very large (petabyte-class) values are recorded for a single minute, until the counter re-stabilizes. We want to report zeros instead of very large values, because that will be less disruptive to the graph.

We propose the following verification check:

The value 2^63 is a convenient number (half the 64-bit value) that is absurdly high for most single increments, so it is unlikely to accidentally catch a legitimate increment. If something is giving us deltas that big, then the monotonic counter is going to be pretty useless as it will be constantly overflowing.

We think it will catch most issues. It is about 9.22e18. If you had some process that starts at 0, increments by 100 GB/s for 90 days and was then restarted and sent a 0, the delta would still be over 2^63. You could still have problems if something accumulated more than 2^63 and then reset, but we do not think that will be very common.

A common way that the large updates get sent is as follows:

This only needs to be done for the MonotonicCounterUint type, because it is the only one that handles overflow conditions - the MonotonicCounter does not, because overflow conditions there are expected to be rare, due to the fact that it should be used mostly for conversions back to base units (e.g. nanos -> seconds).