DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
345 stars 1.01k forks source link

Provide a way to enable some agents at the nodegroup level #673

Open mrzor opened 2 years ago

mrzor commented 2 years ago

Describe what happened:

Different agents bundled in the Daemonset are billed independently and at varying prices. It is sometimes desirable to turn them on a subset of nodes rather than all nodes.

Describe what you expected:

I expected to be able to have finer grained controls over which agents run where.

In this specific instance, possibly having several Daemonsets on which tolerations could be set, instead of one big Daemonset that contains all the different agents which can either be all-on or all-off.

Alternative solutions considered:

  1. Solutions similar in spirit to https://github.com/danfromtitan/envars-from-node-labels would provide a way to have node-specific configuration for the agents, which would in theory allow to turn off i.e. APM agents on some nodegroups. Usual problems related to MutatingAdmissionsController apply here.
  2. Deploying the full datadog chart N times with mutually exclusive tolerations set-up is also something we're considering, but it smells like there are dragons that way.

Both of the above solutions come require some effort to implement and more effort to maintain.

vboulineau commented 2 years ago

Hello @mrzor,

It's something we want to add to our Operator (not there yet), but I'm not sure we'll be able to provide something really useful with our Helm chart.

Would you consider using an Operator once the feature is available? https://github.com/DataDog/datadog-operator

cc @CharlyF

mrzor commented 2 years ago

I find the operator project quite interesting after just skimming the README file. I don't see ourselves deploying it anytime soon, even if the desired feature was provided by it. Such an endeavor would probably require more effort than patching the chart to achieve our desired outcome, or vetting the envvars-from-node-labels project. Notably, migrating an existing Helm-based install to an operator-based one doesn't seem covered by the docs, and the team would probably vote to wait for someone else to walk down that road and document it first.

If you were to consider a Helm-chart upgrade to solve this, which way looks the most promising?

ian-axelrod commented 2 years ago

@mrzor I also desire a solution to this. My thought is that your alternative solution #2 is the way to go, but you do mention there could be issues. Is this just a feeling, or do you have a reason for your concern?

mrzor commented 2 years ago

@ian-axelrod Let's say I'm 75% confident in the following:

  1. if for any reason you end up with more than one datadog pod per node, you will have duplicated telemetry and probably some very weird metrics and/or autodiscovered checks
  2. there should be only be one set of cluster-agents - that's just a landmine to circumvent using proper values
  3. more surprises like the above?
levan-m commented 2 months ago

Operator v1.5.0 introduced Datadog Agent Profiles (in beta) which would address this use case https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md

Robust profile management requires runtime changes which isn't feasible when managing Agents using Helm, so we aren't considering this path. Let as know if there are any reservations about migrating to the Operator.