DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

[cpu] higher check runner counts may produce CPU spikes #919

Open truthbk opened 6 years ago

truthbk commented 6 years ago

Describe what happened:

Due to the way we process and schedule checks, when the number of check runner go-routines is high there's a chance to experience CPU spikes.

As discussed, we believe that due to the fact we schedule checks to run at fixed intervals when the number of runners is high, and in particular when these checks are likely to wait on system calls (check instances - ie. check runs - will release the GIL when waiting for the OS to return) will drive up the number of python checks concurrently running and therefore drive up the CPU utilization. A lower number of check runners reduces the concurrency and lowers the CPU utilization.

A single python check runner (other than long running-checks) replicates the agent5 behavior where instances ran serially - resulting in the lowest possible CPU load.

The overall averaged CPU also had an increment even after being averaged out. So it's not just a spike, but an increase in overall CPU load (scheduling and context-switching overhead?).

Describe what you expected:

A flatter+lower CPU profile/footprint would be preferable.

Steps to reproduce the issue:

Possible fixes

olivielpeau commented 6 years ago

On a machine with 2 cores, running about 15 python checks total (most of them process checks), here's a typical cpu graph depending on the number of check runners that are running "real" checks (i.e. excluding the runners that run long-running checks):

screen shot 2017-12-08 at 6 03 12 pm

The spikes happen every 2 minutes because that's the frequency at which process checks refresh their caches (when they do, they call psutil.process_iter and iterate over all processes to find matching processes)

olivielpeau commented 6 years ago

Done some more testing: this behavior of the python runtime can be reproduced outside of the Agent6, with 2 simple python scripts that attempt to mimic the behavior of the process check:

Given:

import psutil
def list_psutil_processes():
    for p in psutil.process_iter():
        print p.name()

Every n seconds (and accounting for total execution time of each sequence to compute next run) we run directly with python:

On Dev account: https://app.datadoghq.com/dash/417599?live=false&page=0&is_auto=false&from_ts=1512757351829&to_ts=1512758196361&tile_size=m

The scripts use on average, on an 8-core Linux VM:

masci commented 6 years ago

In general, I'm more interested on fetching numbers able to describe how much overhead an highly concurrent scheduler is adding to the metrics collection cycle - the use case here (IO Intensive checks with multiple instances) looks a lot like a corner case and I'd like to collect more info and feedback before claiming we have a fire to put out.

With this in mind and in regard of possible fixes,

Modify instance scheduling

I strongly advice against this: implementation would add significant complexity, specially considering that Autodiscovery can easily change the number of check instances running at every given time.

Reduce the number of workers.

This can be done to some extent in order to provide a more reasonable default but IMO we should still prefer concurrency - users should be able to reduce it, even drastically, but only if spikes happen and when that represents a problem.

xvello commented 6 years ago

Spiking cpu/mem usage will cause issues with the docker agent if limits are set:

This means if we want reliable behaviour in containers, we must aim for the flatter resource usage profile possible. Could we "autoscale" the runner number depending on the total collection time?