DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
347 stars 1.02k forks source link

[charts/datadog] Can't easily override cluster-agent-confd #657

Closed ivankatliarchuk closed 2 months ago

ivankatliarchuk commented 2 years ago

Describe what happened: Currently we are using datadog helm chart as a dependency e.g. Chart.yaml

dependencies:
- name: datadog
  version: 2.35.3
  repository: https://helm.datadoghq.com

and in our values.yml confd configuration looks similar to

datadog:
  clusterAgent:
    confd:
      postgres.yaml: |-
        cluster_check: true
        instances:
          - server: RDS_ENDPOINT.rds.amazonaws.com
            port: 5432
            user: datadog
            pass: ENC[DATADOG_RDS_PWD]
      kube_apiserver_metrics.yaml: |-
        cluster_check: true
        init_config:
        instances:
          - prometheus_url: KUBE_MASTER_ENDPOINT/metrics
      kafka_consumer_1.yaml: |-
        cluster_check: true
        init_config:
          max_partition_contexts: 2000
        instances:
          - kafka_connect_str: MSK_SHARED_BOOTSTRAP_BROKERS_1
            security_protocol: SSL
            monitor_unlisted_consumer_groups: true
            monitor_all_broker_highwatermarks: true
            tags:
              - cluster_name:CLUSTER_ENV-some-other-name
      kafka_consumer_2.yaml: |-
        cluster_check: true
        init_config:
          max_partition_contexts: 2000
        instances:
          - kafka_connect_str: MSK_SHARED_BOOTSTRAP_BROKERS_2
            security_protocol: SSL
            monitor_unlisted_consumer_groups: true
            monitor_all_broker_highwatermarks: true
            tags:
              - cluster_name:CLUSTER_ENV-some-other-name

This is just an example, we have way more services supported. From the first look, everything can be substituted. However, the devil is in the details. Our case, every environment has different services available, as well as all the values derived from live environment before helm install/update, more important, one environment may have 5 different services, the other one 3. For example sandbox has 0 kafka_consumers when dev 2 and prod 6.

We do have default values.yaml as well as dev.yaml, sandbox.yaml and etc. But this static files do not work. As we do need dynamicaly to add/remove this configuration from confd.

We tried kustomize, but it was not working well. There are probably other hacks. Current process looks like this at the mean time Steps:

  1. values.yaml contains all the confd configuration for all the environments
  2. CI system pull all the dynamic values from environment
  3. CI system uses sed to substitute some of the values in values.yaml
  4. CI system uses yq, and removes entries from clusterAgent.confd e.g. yq -yi "del(.datadog.clusterAgent.confd[\"kafka_consumer_1.yaml\"])
  5. CI system does chart deploy

This approach is not ideal, due to different tooling used on top of helm e.g. sed and yq. What more important, it's super easy to introduce a new bug, or sometimes new services are added, and they are probably not added to datadog agent, or more often case; the service is no longer running, and the config does not correctly remove the entry, as a result, we have a handful amount of error logs saying that service not found or similar. Of course, there are multiple ways to approach the problem. In past, we were using an in-house helm chart, but the upstream official chart seems a way forward for us.

Describe what you expected: To be able to override .Values.clusterAgent.confd at helm deployment time. How this can be done

Currently the confd is set directly here.

data:
{{- if .Values.clusterAgent.confd }}
{{ tpl (toYaml .Values.clusterAgent.confd) . | indent 2 }}
{{- end }}

The proposal is to move all the references of .Values.clusterAgent.confd to helpers.tpl so that there is no need to override the whole template or try to find a way how to turn on/off entries in config, but just a function. For example

original charts/datadog/helpers.tpl

{{- define "datadog.clusterAgentConfD" -}}

will let us to override in top-level chart top-level mychart/helpers.tpl

{{- define "datadog.clusterAgentConfD" -}}

Override function can have more verbose and templated logic as opposite to original chart. Plus this should let us to validate confd at deploy time within helm chart as well e.g. if value not set for example, or incorrect format

Additional environment details (Operating System, Cloud provider, etc): AWS

Is this a feature you are interested in implementing yourself?

I can

ivankatliarchuk commented 2 months ago

Closing, as currently supported with tpl here https://github.com/DataDog/helm-charts/blob/3097984ae336c57ada9a41b2e7a9032c58b99440/charts/datadog/templates/cluster-agent-confd-configmap.yaml#L13

Having a function in helpers is a great addition do