[cpu] higher check runner counts may produce CPU spikes

truthbk commented 6 years ago

Describe what happened:

Due to the way we process and schedule checks, when the number of check runner go-routines is high there's a chance to experience CPU spikes.

As discussed, we believe that due to the fact we schedule checks to run at fixed intervals when the number of runners is high, and in particular when these checks are likely to wait on system calls (check instances - ie. check runs - will release the GIL when waiting for the OS to return) will drive up the number of python checks concurrently running and therefore drive up the CPU utilization. A lower number of check runners reduces the concurrency and lowers the CPU utilization.

A single python check runner (other than long running-checks) replicates the agent5 behavior where instances ran serially - resulting in the lowest possible CPU load.

The overall averaged CPU also had an increment even after being averaged out. So it's not just a spike, but an increase in overall CPU load (scheduling and context-switching overhead?).

Describe what you expected:

A flatter+lower CPU profile/footprint would be preferable.

Steps to reproduce the issue:

High check runner count
IO Intensive checks (waiting on system calls will increase concurrent work in the python interpreter) with multiple instances

Possible fixes

Reduce the number of workers.
Modify instance scheduling - maybe spreading execution of checks over the interval as opposed to queueing everything until the next collection iteration begins, where all work would be available as soon as possible.
Not doing anything and embracing higher concurrency at the expense of unexpected spikes. Probably not a satisfactory solution since the overall CPU footprint is increased.

olivielpeau commented 6 years ago

On a machine with 2 cores, running about 15 python checks total (most of them process checks), here's a typical cpu graph depending on the number of check runners that are running "real" checks (i.e. excluding the runners that run long-running checks):

screen shot 2017-12-08 at 6 03 12 pm

The spikes happen every 2 minutes because that's the frequency at which process checks refresh their caches (when they do, they call psutil.process_iter and iterate over all processes to find matching processes)

olivielpeau commented 6 years ago

Done some more testing: this behavior of the python runtime can be reproduced outside of the Agent6, with 2 simple python scripts that attempt to mimic the behavior of the process check:

Given:

import psutil
def list_psutil_processes():
    for p in psutil.process_iter():
        print p.name()

Every n seconds (and accounting for total execution time of each sequence to compute next run) we run directly with python:

Script 1 that runs list_psutil_processes in 20 threading.Threads all started concurrently (they're all started at the same time).
Script 2 that runs list_psutil_processes in 20 threading.Threads started sequentially (one is started only once the previous one has joined)

On Dev account: https://app.datadoghq.com/dash/417599?live=false&page=0&is_auto=false&from_ts=1512757351829&to_ts=1512758196361&tile_size=m

The scripts use on average, on an 8-core Linux VM:

90% of a CPU core for script 1
10% of a CPU core for script 2

masci commented 6 years ago

In general, I'm more interested on fetching numbers able to describe how much overhead an highly concurrent scheduler is adding to the metrics collection cycle - the use case here (IO Intensive checks with multiple instances) looks a lot like a corner case and I'd like to collect more info and feedback before claiming we have a fire to put out.

With this in mind and in regard of possible fixes,

Modify instance scheduling

I strongly advice against this: implementation would add significant complexity, specially considering that Autodiscovery can easily change the number of check instances running at every given time.

Reduce the number of workers.

This can be done to some extent in order to provide a more reasonable default but IMO we should still prefer concurrency - users should be able to reduce it, even drastically, but only if spikes happen and when that represents a problem.

xvello commented 6 years ago

Spiking cpu/mem usage will cause issues with the docker agent if limits are set:

if CPU limit is reached, the whole container is frozen for a bit, possibliy making dogstatsd unresponsive
if mem limit is reached, oomkiller will kill one process in the container

This means if we want reliable behaviour in containers, we must aim for the flatter resource usage profile possible. Could we "autoscale" the runner number depending on the total collection time?

DataDog / datadog-agent

[cpu] higher check runner counts may produce CPU spikes #919