Re-work gather_slow_system_metrics to Operate as a Fixed Rate instead of a Fixed Delay

copperlight commented 2 months ago

In the environment, we have noticed that some of the slow metrics, particularly for CPU gauges, they can drift just enough that it appears to have no increments for one minute and double-increments for the next. This behavior may persist for an hour or two before returning to normal. We think this is due to a slight amount of drift in the fixed 60 second delay which causes it to cross minute boundaries.

When we move to a Fixed Delay, we want the increments to land about in the middle of the minute, factoring in the delay associated with gathering slow metrics, as well as the 5-second publishing interval for spectatord.

See the spectator-java Scheduler for an example implementation:

https://github.com/Netflix/spectator/blob/main/spectator-api/src/main/java/com/netflix/spectator/impl/Scheduler.java#L406

copperlight commented 2 months ago

I produced a debug build which logs timings for the gathers and scheduling.

On my test system (bionic), which does little work, the first gather takes ~40ms, and subsequent gathers take ~35ms. The next slow run time stays consistent to the minute. We will need to take a look at a loaded system to see what kinds of delays are experienced there.

[2024-08-28 17:43:23.547] [main_logger] [info] Start gathering EC2 system metrics
[2024-08-28 17:43:23.548] [main_logger] [info] Gather EC2 system metrics (start=2024-08-28 17:43:23.548037780)
[2024-08-28 17:43:23.590] [main_logger] [info] Gather EC2 system metrics (  end=2024-08-28 17:43:23.590100824 elapsed=42ms)
[2024-08-28 17:44:23.590] [main_logger] [info] Gather EC2 system metrics (start=2024-08-28 17:44:23.590423030)
[2024-08-28 17:44:23.621] [main_logger] [info] Gather EC2 system metrics (  end=2024-08-28 17:44:23.621470758 elapsed=31ms)
[2024-08-28 17:44:23.621] [main_logger] [info] next_slow_run=2024-08-28 17:45:23.590143117
[2024-08-28 17:45:23.590] [main_logger] [info] Gather EC2 system metrics (start=2024-08-28 17:45:23.590347409)
[2024-08-28 17:45:23.619] [main_logger] [info] Gather EC2 system metrics (  end=2024-08-28 17:45:23.619033899 elapsed=28ms)
[2024-08-28 17:45:23.619] [main_logger] [info] next_slow_run=2024-08-28 17:46:23.590143117
[2024-08-28 17:46:23.590] [main_logger] [info] Gather EC2 system metrics (start=2024-08-28 17:46:23.590403662)
[2024-08-28 17:46:23.620] [main_logger] [info] Gather EC2 system metrics (  end=2024-08-28 17:46:23.620208861 elapsed=29ms)
[2024-08-28 17:46:23.620] [main_logger] [info] next_slow_run=2024-08-28 17:47:23.590143117
[2024-08-28 17:47:23.590] [main_logger] [info] Gather EC2 system metrics (start=2024-08-28 17:47:23.590433890)
[2024-08-28 17:47:23.622] [main_logger] [info] Gather EC2 system metrics (  end=2024-08-28 17:47:23.621979803 elapsed=31ms)
[2024-08-28 17:47:23.622] [main_logger] [info] next_slow_run=2024-08-28 17:48:23.590143117

copperlight commented 2 months ago

On one of the systems where we were seeing CPU metrics issues, the gathers were taking ~100ms, and the next slow run scheduling appeared to be stable. We should only see drift in cases where gathers take longer than one second.

On some systems, the main metrics demonstrating the issue are sys.core.cpuUtilization, but it can also be observed in other metrics such as net.tcp.opens across active and passive.

It does not appear that spectatord is contributing to the issue - we are not seeing publishing drops that correlate with the periods of odd metrics values.

copperlight commented 2 months ago

Brian was able to positively confirm that the issue here is due to atlas-system-agent publishing values very close to the minute boundary (purely by random chance). While the atlas-system-agent and spectatord have behaved this way for years, recent changes in atlas-aggregator sharding have added the possibility for a little more delay, which can cause the metrics to cross over into the next minute, leading to spikes and gaps.

==== Brian Harrington ====

Checking the logs of the instance for a particular value:

2024-08-29T06:37:00Z    8.733334    15.650000
2024-08-29T06:38:00Z    9.100000    31.966667
2024-08-29T06:39:00Z    8.700000    0.000000
2024-08-29T06:40:00Z    9.666667    16.250000

The system agent is reporting each minute very close to the minute boundary:

$ sudo journalctl -u atlas-system-agent | grep -E 'Aug 29 06:3[8-9]'
Aug 29 06:38:59 start-atlas-system-agent[3479]: [2024-08-29 06:38:59.126] [main_logger] [info] Gathering system metrics
Aug 29 06:39:59 start-atlas-system-agent[3479]: [2024-08-29 06:39:59.126] [main_logger] [info] Gathering system metrics

SpectatorD is sending every 5s with a bit of fluctuation. I trimmed the log to the first send after the system metrics were pushed:


$ sudo journalctl -u spectatord | grep -E 'Aug 29 06:3[8-9]'
...
Aug 29 06:38:59 start-spectatord[3474]: [2024-08-29 06:38:59.901] [spectator] [info] Sent: 370 Dropped: 0 Total: 370. Elapsed 0.216s
...
Aug 29 06:39:59 start-spectatord[3474]: [2024-08-29 06:39:59.894] [spectator] [info] Sent: 376 Dropped: 0 Total: 376. Elapsed 0.214s

They are ~100ms from the minute boundary. So essentially what I think is happening here is the problematic instances just have the system collection start time close to the minute boundary.

During some windows spectatord will do the 5s send in between the system collection time and the minute boundary. In those cases by the aggregator probably takes long enough to process the data that it is a bit of a toss up which interval it is attributed to.

While this was always possible, the recent aggregator sharding changes likely increase the probability of it happening because the data would have an additional hop which does add a small amount of delay for things like network times, parsing, GC pauses, etc.

copperlight commented 2 months ago

It was possible to find this timing discrepancy in one particular application, because it was running old versions of atlas-system-agent and spectatord, which were still publishing status messages to their log files which helped correlate the timing of updates. Newer versions moved those log messages to debug level, to help save on log volume, given how repetitive these messages can be. Config overrides would have been necessary to expose those messages again - we would need to keep this in mind if we have to revisit something like this again.

However, the fix for this now becomes getting the atlas-system-agent process to wait to start collecting data until 20 seconds past the minute. The collections delay remains small enough that it will stay stable at that time on the instance providing the source data, and the 40 seconds remaining in the minute provides plenty of time for pipeline processing.

Note that we will want to randomize the gathering time throughout the minute, with buffer on either side, to avoid thundering herd issues. The randomization is provided by the now calculation for each system that runs this code.

https://github.com/Netflix/spectator/blob/main/spectator-reg-atlas/src/main/java/com/netflix/spectator/atlas/AtlasConfig.java#L223

Netflix-Skunkworks / atlas-system-agent

Re-work gather_slow_system_metrics to Operate as a Fixed Rate instead of a Fixed Delay #135