High CPU Usage Diagnostic Tools

DataDog / datadog-agent

Main repository for Datadog Agent

https://docs.datadoghq.com/

Apache License 2.0

2.9k stars 1.21k forks source link

High CPU Usage Diagnostic Tools #1852

Open volfco opened 6 years ago

volfco commented 6 years ago

I'm running the datadog agent (v6.1.4) on a box where applications emit a lot of metrics to dogstatsd. Below is the CPU usage graph.

Yep. Yikes (that's 500 to 800%). I'm confident this is caused by the volume of dogstats metrics emitted by an application. The problem is that I don't know what's doing the high volume of emissions, and I can't see an easy way of getting this information from the agent.

=========
DogStatsD
=========

  Checks Metric Sample: 108984
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 26
  Series Flushed: 586240
  Service Check: 18255
  Service Checks Flushed: 17924
  Dogstatsd Metric Sample: 6.5249337e&#43;07

Is there a way where I can see the top metrics emitted for the past minute or two? I'd love to figure out what's causing the massive consumption of resources.

Additional environment details (Operating System, Cloud provider, etc): CentOS 7. Physical Hardware

xvello commented 6 years ago

Hi @volfco

We have identified this high cpu/memory usage when dogstatsd has to process/serialize a lot of contexts, and are working on it. Are you using high cardinality tags in your custom metrics?

In order to see metrics coming in, you can run the agent as log_level: trace, each packet will be printed as this:

2018-06-19 08:22:35 UTC | TRACE | (server.go:193 in worker) | Dogstatsd receive: custom.metric.name:1|c

To investigate which subsystem is actually using that much CPU, we can dump a pprof profile with curl http://localhost:5000/debug/pprof/profile > cpu.prof. I recommend you submit it via a support ticket as it might contain sensitive information.

Regards

volfco commented 6 years ago

After I posted yesterday I ran a netcat to dump the metrics. It's a combination of some high cardinality metrics, but also a common library was vomiting a few metrics.

volfco commented 6 years ago

@xvello What's your definition of high carnality tags? I think for us, the major source here is a large amount of very similar metrics. I've also opened a support case- #152005 with a prof of a box that's running at around 600% cpu.

xvello commented 6 years ago

If the metric is the same, but one tag has many values (one usual trap is to add a user_id for example), every value of this tag will create a different context. This is what we call high cardinality Adding that many contexts can make the payload size balloon up.

xvello commented 6 years ago

Hey @volfco ,

I had a quick look at your dump, and the biggest part of the CPU usage is not in the serialization subsystem (like we can observe when the cardinality is the issue), but on the syscall portion, reading the UDP packets.

Someone from support will investigate this with you. My first guess is that one of the clients is not using client-side buffering. Client-side buffering allows to send several metrics per packet and greatly reduces the load on both the client and on dsd. Please refer to each library's documentation on how to enable buffering.

volfco commented 6 years ago

@xvello Thanks for the info. I've been digging into the kernel side and I do think we're just dumping way too much data into the kernel, overflowing the buffers. That's a different team.

I'll run with support, but I'll assume that the high cpu usage is as expected due to the volume of metrics- and tell the team that maintains our dogstats wrapper that they need start buffering.

xvello commented 6 years ago

@volfco Indeed, this DSD is ingesting 250k points per second, which explains the high CPU usage. Buffering will help by reducing the amount of reads it has to do.

For counters and histograms, dsd also supports sampling to reduce the number of points to transmit.

omerh commented 5 years ago

We have the same issue. agent memory reaches 1Gib per agent per host. Running agent 6.8.3 as a DaemonSet Our statsd metrics are being shipped via the service FQDN (E.G dd-agent.default.svc.cluster.local)