We have a situation where a Go process uses the MonotonicCounterUint type to report values to Atlas, but the process restarts periodically, which then looks like an overflow and very large (petabyte-class) values are recorded for a single minute, until the counter re-stabilizes. We want to report zeros instead of very large values, because that will be less disruptive to the graph.
We propose the following verification check:
If the delta > 2^63, then assume that an unexpected overflow has occurred, and report 0.
The value 2^63 is a convenient number (half the 64-bit value) that is absurdly high for most single increments, so it is unlikely to accidentally catch a legitimate increment. If something is giving us deltas that big, then the monotonic counter is going to be pretty useless as it will be constantly overflowing.
We think it will catch most issues. It is about 9.22e18. If you had some process that starts at 0, increments by 100 GB/s for 90 days and was then restarted and sent a 0, the delta would still be over 2^63. You could still have problems if something accumulated more than 2^63 and then reset, but we do not think that will be very common.
A common way that the large updates get sent is as follows:
Record values up to 100,000. Process restarts and the next value is 0.
curr = 0, prev = 100k , so curr < prev (src) and you would get 2^64 - 100k(prev) + 0(curr) + 1 which is ~ 2^64.
This only needs to be done for the MonotonicCounterUint type, because it is the only one that handles overflow conditions - the MonotonicCounter does not, because overflow conditions there are expected to be rare, due to the fact that it should be used mostly for conversions back to base units (e.g. nanos -> seconds).
We have a situation where a Go process uses the
MonotonicCounterUint
type to report values to Atlas, but the process restarts periodically, which then looks like an overflow and very large (petabyte-class) values are recorded for a single minute, until the counter re-stabilizes. We want to report zeros instead of very large values, because that will be less disruptive to the graph.We propose the following verification check:
delta > 2^63
, then assume that an unexpected overflow has occurred, and report 0.The value
2^63
is a convenient number (half the 64-bit value) that is absurdly high for most single increments, so it is unlikely to accidentally catch a legitimate increment. If something is giving us deltas that big, then the monotonic counter is going to be pretty useless as it will be constantly overflowing.We think it will catch most issues. It is about
9.22e18
. If you had some process that starts at 0, increments by 100 GB/s for 90 days and was then restarted and sent a 0, the delta would still be over2^63
. You could still have problems if something accumulated more than2^63
and then reset, but we do not think that will be very common.A common way that the large updates get sent is as follows:
curr = 0
,prev = 100k
, socurr < prev
(src) and you would get2^64 - 100k(prev) + 0(curr) + 1
which is ~2^64
.This only needs to be done for the
MonotonicCounterUint
type, because it is the only one that handles overflow conditions - theMonotonicCounter
does not, because overflow conditions there are expected to be rare, due to the fact that it should be used mostly for conversions back to base units (e.g. nanos -> seconds).