Open SleepyBrett opened 6 years ago
Hi @SleepyBrett Thanks for reaching out. I'll try to address all of your points, let me know if I missed some.
We find that rightsizing the dd agent daemonset is impossible for any cluster with any significant workload if you turn on ksm autodiscovery. This is because one, or multiple if you shard your ksm by collector, agent has much more work to do (scraping ksm) than the rest (that are only collecting container/node metrics).
We're aware of this issue, and we recommend sharding ksm per namespace, and keeping namespaces small. Not just for making the agent happy, we also observed it improves performance in large k8s clusters.
I imagine this might also be a problem if I turned on event collection, the "leader" would also have much higher cpu/memory usage than the other nodes.
You're right, although we're moving event collection to the cluster agent (not GA yet, but soon to be) so the issue will go away.
Create two ksm pods, one ksm set up only to do pod and the other to do all other collectors
This sharding is also the one we started with internally, but it doesn't solve the problem for two reasons:
If splitting by namespace is not an option for you, splitting by collector is still your best bet for now. In our case we experimented with splitting out configmaps, endpoints, and services to smooth the load some more, but that depends on your workload, YMWV. You could also disable the collectors we don't collect metrics about. Unfortunately pods remain the main issue.
Can I just step back a moment and say "true/false" OR "yes/no" .. maybe they are interchangeable, it's not at all clear...
They are interchangeable in the config, but yes/no don't work well in env variables, so we're consolidating to true/false. See: https://github.com/DataDog/datadog-agent/pull/2171
and then mounting [a datadog.yaml with empty config_providers]
This is a good idea but the file config provider is initialized anyway, because the agent is supposed to run some checks by default. You can disable them by mounting an empty volume in place of /etc/datadog-agent/conf.d/
to remove the default check configs.
Again, we don't recommend going the sidecar route, the agent is not designed for this and things might break. Sharding ksm by namespace is more scalable in the long run. But if you're doing it anyway you may want to disable host metadata collection to avoid weird host duplication issues in the app: https://github.com/DataDog/datadog-agent/blob/aa3fd27e8c7351b19f243f3e2cca7498d96aa690/cmd/agent/app/start.go#L220-L228 (set DD_ENABLE_METADATA_COLLECTION
to false
)
One last point that will help some more soon: we're working on a revamp of OpenMetrics parsing which shows promising results, performance wise. You can expect the load of ksm parsing to drop in a soon-to-come release
Hope that helps.
We're aware of this issue, and we recommend sharding ksm per namespace, and keeping namespaces small. Not just for making the agent happy, we also observed it improves performance in large k8s clusters.
As the manager of several multi-tennant clusters I can't see this as a realistic strategy. I have dozens of namespaces across 6-10 clusters, more added every day. I'd like to see a critique of why my solution of sidecaring a special agent w/ ksm is not a valid strategy. As a little side-quest i managed to get this working fairly cleanly with Veneur, however it does not do the transformation of the KSM statistics like your agent does.
I'll be trying this Veneur config today w/ my largest cluster to see if it can hold up under the load caused by the ksms on that cluster (which is still very modestly sized by kube deployment standards).
I'd suggest your engineering team go back to the drawing board with the 'shard ksm per namespace' strategy unless they plan to write an operator to handle that work. Even if they do I imagine that pretty quickly we'd be looking at problems with the amount of load all those KSMs will put on the kube-apiserver.
Again, we don't recommend going the sidecar route, the agent is not designed for this and things might break.
Guess what? It's already broken in your "preferred configuration" based on both your documentation and your helm charts even on modest clusters ( ~75 nodes, ~7500 pods ).
It is very disappointing that there isn't a way to strip the magic out of your agent and providing the ability to essentially opt into any collector i'd like to use with out 1) stamping out a directory 2) modifying your agent sourcecode(?!?).
Venure holds up against the ksm load on the ~75 node/+7.5k pod cluster without issue (we see occasional request canceled (client timeout) errors from your api endpoint, but retries succeed, no metric continuity errors in our test sub-org). It's not doing the transforms, of course, we are now evaluating the transforms in depth.
Output of the info page (if this is a bug)
Describe what happened: I would like to configure a number of agents on k8s to ONLY scrape ksm.
We find that rightsizing the dd agent daemonset is impossible for any cluster with any significant workload if you turn on ksm autodiscovery. This is because one, or multiple if you shard your ksm by collector, agent has much more work to do (scraping ksm) than the rest (that are only collecting container/node metrics).
I imagine this might also be a problem if I turned on event collection, the "leader" would also have much higher cpu/memory usage than the other nodes.
To that end I am attempting the following:
To that end I'm passing the following env variables to those dd containers:
Can I just step back a moment and say "true/false" OR "yes/no" .. maybe they are interchangeable, it's not at all clear...
and then mounting the following datadog.yaml into /etc/datadog-agent/
and then mounting the following auto_conf.yaml into /conf.d
At this point I expect that I have told the agent DO NOTHING, except scrape
127.0.0.1:8080/metrics
and ship the results.However when i jump into that sidecar and run :
So it looks like I still have several collectors running and some crashing... and it's not at all clear if the kubernetes_state "job" is even running.
Because the documentation isn't super clear (and is often telling me how to configure things in agent 5.x) I started digging into the agent code.
The way it's configured is very confusing. It seems like to me that the following things are happening:
1) s6 is used to start the agent and may do some things re: config before the agent even starts, this is not at all clear to me and I've chosen to mostly ignore it, though i'm not even sure why you would use s6 in a containerized env, philosophically.
1b) at some point s6 starts running things in
/etc/cont-init.d
These files start shuffling things around in your config dirs based on env variables/files on the filesystem/voodoo magic.2) now the agent starts and it does yet more "magic", i think most of this magic is constrained to ./pkg/config/ but I can't be sure. Again you seem to be starting things based on some combination of env variables, files on the filesystem, etc. There seems to be some backwards compatibility built in (`/etc/dd-agent/`)
...
All this is to say that in an effort to be magical, it's very hard for someone who doesn't happen to be a datadog engineer to configure your agent by hand if that is the appropriate thing to do.
Describe what you expected: I expect ONLY the ksm metrics to be shipped from this sidecar container
Steps to reproduce the issue:
Additional environment details (Operating System, Cloud provider, etc):