Open volfco opened 6 years ago
Hi @volfco
We have identified this high cpu/memory usage when dogstatsd has to process/serialize a lot of contexts, and are working on it. Are you using high cardinality tags in your custom metrics?
In order to see metrics coming in, you can run the agent as log_level: trace
, each packet will be printed as this:
2018-06-19 08:22:35 UTC | TRACE | (server.go:193 in worker) | Dogstatsd receive: custom.metric.name:1|c
To investigate which subsystem is actually using that much CPU, we can dump a pprof profile with curl http://localhost:5000/debug/pprof/profile > cpu.prof
. I recommend you submit it via a support ticket as it might contain sensitive information.
Regards
After I posted yesterday I ran a netcat to dump the metrics. It's a combination of some high cardinality metrics, but also a common library was vomiting a few metrics.
@xvello What's your definition of high carnality tags
? I think for us, the major source here is a large amount of very similar metrics. I've also opened a support case- #152005 with a prof of a box that's running at around 600% cpu.
If the metric is the same, but one tag has many values (one usual trap is to add a user_id
for example), every value of this tag will create a different context. This is what we call high cardinality
Adding that many contexts can make the payload size balloon up.
Hey @volfco ,
I had a quick look at your dump, and the biggest part of the CPU usage is not in the serialization subsystem (like we can observe when the cardinality is the issue), but on the syscall portion, reading the UDP packets.
Someone from support will investigate this with you. My first guess is that one of the clients is not using client-side buffering. Client-side buffering allows to send several metrics per packet and greatly reduces the load on both the client and on dsd. Please refer to each library's documentation on how to enable buffering.
@xvello Thanks for the info. I've been digging into the kernel side and I do think we're just dumping way too much data into the kernel, overflowing the buffers. That's a different team.
I'll run with support, but I'll assume that the high cpu usage is as expected due to the volume of metrics- and tell the team that maintains our dogstats wrapper that they need start buffering.
@volfco Indeed, this DSD is ingesting 250k points per second, which explains the high CPU usage. Buffering will help by reducing the amount of reads it has to do.
For counters and histograms, dsd also supports sampling to reduce the number of points to transmit.
We have the same issue. agent memory reaches 1Gib per agent per host. Running agent 6.8.3 as a DaemonSet Our statsd metrics are being shipped via the service FQDN (E.G dd-agent.default.svc.cluster.local)
I'm running the datadog agent (v6.1.4) on a box where applications emit a lot of metrics to dogstatsd. Below is the CPU usage graph.
Yep. Yikes (that's 500 to 800%). I'm confident this is caused by the volume of dogstats metrics emitted by an application. The problem is that I don't know what's doing the high volume of emissions, and I can't see an easy way of getting this information from the agent.
Is there a way where I can see the top metrics emitted for the past minute or two? I'd love to figure out what's causing the massive consumption of resources.
Additional environment details (Operating System, Cloud provider, etc): CentOS 7. Physical Hardware