Datadog-Agent service isn't being created and thus APM isn't working...

jurschel commented 2 years ago

Describe what happened: Installed latest helm chart and APM will not function as I also only have the helm deployed datadog-agent and NO host level datadog-agent installed.

Describe what you expected: I install the helm chart with the appropriate APM settings and then APM works.

Steps to reproduce the issue: Ensure no host datadog-agent is installed on the node Install the current helm chart with apm settings enabled kubectl get service -A and see that no datadog-agent service exists for traceport:8126 traffic to flow to

Additional environment details (Operating System, Cloud provider, etc): Bare-Metal Ubuntu 20.04 Datadog-Agent 7.33.0 Helm Chart - 2.30.0

jurschel commented 2 years ago

I would point you to here in the code where the chart is not naming the service correctly. https://github.com/DataDog/helm-charts/blob/24a79128ffb67a5fd48299ad153ed1f879db4167/charts/datadog/templates/agent-services.yaml#L93 This should be traceport if you want it to actually work. Also some documentation about this setting would be nice as this is why the service wasn't being created for me automatically like the daemonset is doing... https://github.com/DataDog/helm-charts/blob/24a79128ffb67a5fd48299ad153ed1f879db4167/charts/datadog/values.yaml#L1228

jurschel commented 2 years ago

Until you get the chart fixed if anyone else is having this problem you can pull the agent-services.yaml out of the daemonset yaml and change the name of the apm service port to traceport and apply it and things will start working...

# Source: datadog/templates/agent-services.yaml
apiVersion: v1
kind: Service
metadata:
  name: datadog-agent
  namespace: default
  labels:
    app: "datadog-agent"
    chart: "datadog-2.28.13"
    release: "datadog-agent"
    heritage: "Helm"
spec:
  selector:
    app: datadog-agent
  ports:
    - protocol: UDP
      port: 8125
      targetPort: 8125
      name: dogstatsd
    - protocol: TCP
      port: 8126
      targetPort: 8126
      name: traceport
  internalTrafficPolicy: Local

jurschel commented 2 years ago

@clamoriniere

clamoriniere commented 2 years ago

Hi @jurschel On which kubernetes version are you trying to install the agent? The minimum Kubernetes version to have the service created is 1.22.0+. If you are using an older kubernetes version you can configure the Agent to expose the APM service thanks to an hostport see: https://github.com/DataDog/helm-charts/tree/24a79128ffb67a5fd48299ad153ed1f879db4167/charts/datadog#enabling-apm-and-tracing

could you share the values.yaml used with the helm install command.

Thanks

jurschel commented 2 years ago

helm install datadog-agent -f values.yaml --set datadog.clusterName='non-prod' --set datadog.site='datadoghq.com' --set datadog.apiKey='xxx' datadog/datadog

Kubernetes Version 1.21.3

## Default values for Datadog Agent
## See Datadog helm documentation to learn more:
## https://docs.datadoghq.com/agent/kubernetes/helm/

# nameOverride -- Override name of app
nameOverride:  # ""

# fullnameOverride -- Override the full qualified app name
fullnameOverride:  # ""

# targetSystem -- Target OS for this deployment (possible values: linux, windows)
targetSystem: "linux"

# registry -- Registry to use for all Agent images (default gcr.io)
## Currently we offer Datadog Agent images on:
## GCR - use gcr.io/datadoghq (default)
## DockerHub - use docker.io/datadog
## AWS - use public.ecr.aws/datadog
registry: gcr.io/datadoghq

datadog:
  # datadog.apiKey -- Your Datadog API key
  # ref: https://app.datadoghq.com/account/settings#agent/kubernetes
  apiKey: <DATADOG_API_KEY>

  # datadog.apiKeyExistingSecret -- Use existing Secret which stores API key instead of creating a new one. The value should be set with the `api-key` key inside the secret.
  ## If set, this parameter takes precedence over "apiKey".
  apiKeyExistingSecret:  # <DATADOG_API_KEY_SECRET>

  # datadog.appKey -- Datadog APP key required to use metricsProvider
  ## If you are using clusterAgent.metricsProvider.enabled = true, you must set
  ## a Datadog application key for read access to your metrics.
  appKey:  # <DATADOG_APP_KEY>

  # datadog.appKeyExistingSecret -- Use existing Secret which stores APP key instead of creating a new one. The value should be set with the `app-key` key inside the secret.
  ## If set, this parameter takes precedence over "appKey".
  appKeyExistingSecret:  # <DATADOG_APP_KEY_SECRET>

  ## Configure the secret backend feature https://docs.datadoghq.com/agent/guide/secrets-management
  ## Examples: https://docs.datadoghq.com/agent/guide/secrets-management/#setup-examples-1
  secretBackend:
    # datadog.secretBackend.command -- Configure the secret backend command, path to the secret backend binary.
    ## Note: If the command value is "/readsecret_multiple_providers.sh" the agents will have permissions to get secret objects.
    ## Read more about "/readsecret_multiple_providers.sh": https://docs.datadoghq.com/agent/guide/secrets-management/#script-for-reading-from-multiple-secret-providers-readsecret_multiple_providerssh
    command:  # "/readsecret.sh" or "/readsecret_multiple_providers.sh" or any custom binary path

    # datadog.secretBackend.arguments -- Configure the secret backend command arguments (space-separated strings).
    arguments:  # "/etc/secret-volume" or any other custom arguments

    # datadog.secretBackend.timeout -- Configure the secret backend command timeout in seconds.
    timeout:  # 30

  # datadog.securityContext -- Allows you to overwrite the default PodSecurityContext on the Daemonset or Deployment
  securityContext: {}
  #  seLinuxOptions:
  #    user: "system_u"
  #    role: "system_r"
  #    type: "spc_t"
  #    level: "s0"

  # datadog.hostVolumeMountPropagation -- Allow to specify the `mountPropagation` value on all volumeMounts using HostPath
  ## ref: https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation
  hostVolumeMountPropagation: None

  # datadog.clusterName -- Set a unique cluster name to allow scoping hosts and Cluster Checks easily
  ## The name must be unique and must be dot-separated tokens with the following restrictions:
  ## * Lowercase letters, numbers, and hyphens only.
  ## * Must start with a letter.
  ## * Must end with a number or a letter.
  ## * Overall length should not be higher than 80 characters.
  ## Compared to the rules of GKE, dots are allowed whereas they are not allowed on GKE:
  ## https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters#Cluster.FIELDS.name
  clusterName:  # <CLUSTER_NAME>

  # datadog.site -- The site of the Datadog intake to send Agent data to
  ## Set to 'datadoghq.eu' to send data to the EU site.
  site:  # datadoghq.com

  # datadog.dd_url -- The host of the Datadog intake server to send Agent data to, only set this option if you need the Agent to send data to a custom URL
  ## Overrides the site setting defined in "site".
  dd_url:  # https://app.datadoghq.com

  # datadog.logLevel -- Set logging verbosity, valid log levels are: trace, debug, info, warn, error, critical, off
  logLevel: INFO

  # datadog.kubeStateMetricsEnabled -- If true, deploys the kube-state-metrics deployment
  ## ref: https://github.com/kubernetes/kube-state-metrics/tree/kube-state-metrics-helm-chart-2.13.2/charts/kube-state-metrics
  kubeStateMetricsEnabled: true

  kubeStateMetricsNetworkPolicy:
    # datadog.kubeStateMetricsNetworkPolicy.create -- If true, create a NetworkPolicy for kube state metrics
    create: false

  kubeStateMetricsCore:
    # datadog.kubeStateMetricsCore.enabled -- Enable the kubernetes_state_core check in the Cluster Agent (Requires Cluster Agent 1.12.0+)
    ## ref: https://docs.datadoghq.com/integrations/kubernetes_state_core
    enabled: true 

    # datadog.kubeStateMetricsCore.ignoreLegacyKSMCheck -- Disable the auto-configuration of legacy kubernetes_state check (taken into account only when datadog.kubeStateMetricsCore.enabled is true)
    ## Disabling this field is not recommended as it results in enabling both checks, it can be useful though during the migration phase.
    ## Migration guide: https://docs.datadoghq.com/integrations/kubernetes_state_core/?tab=helm#migration-from-kubernetes_state-to-kubernetes_state_core
    ignoreLegacyKSMCheck: true

    # datadog.kubeStateMetricsCore.collectSecretMetrics -- Enable watching secret objects and collecting their corresponding metrics kubernetes_state.secret.*
    ## Configuring this field will change the default kubernetes_state_core check configuration and the RBACs granted to Datadog Cluster Agent to run the kubernetes_state_core check.
    collectSecretMetrics: true

    # datadog.kubeStateMetricsCore.useClusterCheckRunners -- For large clusters where the Kubernetes State Metrics Check Core needs to be distributed on dedicated workers.
    ## Configuring this field will create a separate deployment which will run Cluster Checks, including Kubernetes State Metrics Core.
    ## ref: https://docs.datadoghq.com/agent/cluster_agent/clusterchecksrunner?tab=helm
    useClusterCheckRunners: false

    # datadog.kubeStateMetricsCore.labelsAsTags -- Extra labels to collect from resources and to turn into datadog tag.
    ## It has the following structure:
    ## labelsAsTags:
    ##   <resource1>:        # can be pod, deployment, node, etc.
    ##     <label1>: <tag1>  # where <label1> is the kubernetes label and <tag1> is the datadog tag
    ##     <label2>: <tag2>
    ##   <resource2>:
    ##     <label3>: <tag3>
    ##
    ## Warning: the label must match the transformation done by kube-state-metrics,
    ## for example tags.datadoghq.com/version becomes label_tags_datadoghq_com_version.
    labelsAsTags:
      pod:
        environment: env 
    #  node:
    #    zone: zone
    #    team: team

  ## Manage Cluster checks feature
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/clusterchecks/
  ## Autodiscovery via Kube Service annotations is automatically enabled
  clusterChecks:
    # datadog.clusterChecks.enabled -- Enable the Cluster Checks feature on both the cluster-agents and the daemonset
    enabled: true

  # datadog.nodeLabelsAsTags -- Provide a mapping of Kubernetes Node Labels to Datadog Tags
  nodeLabelsAsTags: {}
  #   beta.kubernetes.io/instance-type: aws-instance-type
  #   kubernetes.io/role: kube_role
  #   <KUBERNETES_NODE_LABEL>: <DATADOG_TAG_KEY>

  # datadog.podLabelsAsTags -- Provide a mapping of Kubernetes Labels to Datadog Tags
  podLabelsAsTags:
     environment: env
  #   release: helm_release
  #   <KUBERNETES_LABEL>: <DATADOG_TAG_KEY>

  # datadog.podAnnotationsAsTags -- Provide a mapping of Kubernetes Annotations to Datadog Tags
  podAnnotationsAsTags: {}
  #   iam.amazonaws.com/role: kube_iamrole
  #   <KUBERNETES_ANNOTATIONS>: <DATADOG_TAG_KEY>

  # datadog.namespaceLabelsAsTags -- Provide a mapping of Kubernetes Namespace Labels to Datadog Tags
  namespaceLabelsAsTags: {}
  #   env: environment
  #   <KUBERNETES_NAMESPACE_LABEL>: <DATADOG_TAG_KEY>

  # datadog.tags -- List of static tags to attach to every metric, event and service check collected by this Agent.
  ## Learn more about tagging: https://docs.datadoghq.com/tagging/
  tags: []
  #   - "<KEY_1>:<VALUE_1>"
  #   - "<KEY_2>:<VALUE_2>"

  # datadog.checksCardinality -- Sets the tag cardinality for the checks run by the Agent.
  ## https://docs.datadoghq.com/getting_started/tagging/assigning_tags/?tab=containerizedenvironments#environment-variables
  checksCardinality:  # low, orchestrator or high (not set by default to avoid overriding existing DD_CHECKS_TAG_CARDINALITY configurations, the default value in the Agent is low)

  # kubelet configuration
  kubelet:
    # datadog.kubelet.host -- Override kubelet IP
    host:
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    # datadog.kubelet.tlsVerify -- Toggle kubelet TLS verification
    # @default -- true
    tlsVerify:  # false
    # datadog.kubelet.hostCAPath -- Path (on host) where the Kubelet CA certificate is stored
    # @default -- None (no mount from host)
    hostCAPath:
    # datadog.kubelet.agentCAPath -- Path (inside Agent containers) where the Kubelet CA certificate is stored
    # @default -- /var/run/host-kubelet-ca.crt if hostCAPath else /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    agentCAPath:

  # datadog.expvarPort -- Specify the port to expose pprof and expvar to not interfer with the agentmetrics port from the cluster-agent, which defaults to 5000
  expvarPort: 6000

  ## dogstatsd configuration
  ## ref: https://docs.datadoghq.com/agent/kubernetes/dogstatsd/
  ## To emit custom metrics from your Kubernetes application, use DogStatsD.
  dogstatsd:
    # datadog.dogstatsd.port -- Override the Agent DogStatsD port
    ## Note: Make sure your client is sending to the same UDP port.
    port: 8125

    # datadog.dogstatsd.originDetection -- Enable origin detection for container tagging
    ## https://docs.datadoghq.com/developers/dogstatsd/unix_socket/#using-origin-detection-for-container-tagging
    originDetection: false

    # datadog.dogstatsd.tags -- List of static tags to attach to every custom metric, event and service check collected by Dogstatsd.
    ## Learn more about tagging: https://docs.datadoghq.com/tagging/
    tags: []
    #   - "<KEY_1>:<VALUE_1>"
    #   - "<KEY_2>:<VALUE_2>"

    # datadog.dogstatsd.tagCardinality -- Sets the tag cardinality relative to the origin detection
    ## https://docs.datadoghq.com/developers/dogstatsd/unix_socket/#using-origin-detection-for-container-tagging
    tagCardinality: low

    # datadog.dogstatsd.useSocketVolume -- Enable dogstatsd over Unix Domain Socket with an HostVolume
    ## ref: https://docs.datadoghq.com/developers/dogstatsd/unix_socket/
    useSocketVolume: true

    # datadog.dogstatsd.socketPath -- Path to the DogStatsD socket
    socketPath: /var/run/datadog/dsd.socket

    # datadog.dogstatsd.hostSocketPath -- Host path to the DogStatsD socket
    hostSocketPath: /var/run/datadog/

    # datadog.dogstatsd.useHostPort -- Sets the hostPort to the same value of the container port
    ## Needs to be used for sending custom metrics.
    ## The ports need to be available on all hosts.
    ##
    ## WARNING: Make sure that hosts using this are properly firewalled otherwise
    ## metrics and traces are accepted from any host able to connect to this host.
    useHostPort: true

    # datadog.dogstatsd.useHostPID -- Run the agent in the host's PID namespace
    ## This is required for Dogstatsd origin detection to work.
    ## See https://docs.datadoghq.com/developers/dogstatsd/unix_socket/
    useHostPID: false

    # datadog.dogstatsd.nonLocalTraffic -- Enable this to make each node accept non-local statsd traffic (from outside of the pod)
    ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
    nonLocalTraffic: true

  # datadog.collectEvents -- Enables this to start event collection from the kubernetes API
  ## ref: https://docs.datadoghq.com/agent/kubernetes/#event-collection
  collectEvents: true

  # datadog.leaderElection -- Enables leader election mechanism for event collection
  leaderElection: true

  # datadog.leaderLeaseDuration -- Set the lease time for leader election in second
  leaderLeaseDuration:  # 60

  ## Enable logs agent and provide custom configs
  logs:
    # datadog.logs.enabled -- Enables this to activate Datadog Agent log collection
    ## ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup
    enabled: true

    # datadog.logs.containerCollectAll -- Enable this to allow log collection for all containers
    ## ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup
    containerCollectAll: true

    # datadog.logs.containerCollectUsingFiles -- Collect logs from files in /var/log/pods instead of using container runtime API
    ## It's usually the most efficient way of collecting logs.
    ## ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup
    containerCollectUsingFiles: true

  ## Enable apm agent and provide custom configs
  apm:
    # datadog.apm.socketEnabled -- Enable APM over Socket (Unix Socket or windows named pipe)
    ## ref: https://docs.datadoghq.com/agent/kubernetes/apm/
    socketEnabled: false

    # datadog.apm.portEnabled -- Enable APM over TCP communication (port 8126 by default)
    ## ref: https://docs.datadoghq.com/agent/kubernetes/apm/
    portEnabled: true 

    # datadog.apm.enabled -- Enable this to enable APM and tracing, on port 8126
    # DEPRECATED. Use datadog.apm.portEnabled instead
    ## ref: https://github.com/DataDog/docker-dd-agent#tracing-from-the-host
    enabled: false

    # datadog.apm.port -- Override the trace Agent port
    ## Note: Make sure your client is sending to the same UDP port.
    port: 8126

    # datadog.apm.useSocketVolume -- Enable APM over Unix Domain Socket
    # DEPRECATED. Use datadog.apm.socketEnabled instead
    ## ref: https://docs.datadoghq.com/agent/kubernetes/apm/
    useSocketVolume: false

    # datadog.apm.socketPath -- Path to the trace-agent socket
    socketPath: /var/run/datadog/apm.socket

    # datadog.apm.hostSocketPath -- Host path to the trace-agent socket
    hostSocketPath: /var/run/datadog/

  # datadog.envFrom -- Set environment variables for all Agents directly from configMaps and/or secrets
  ## envFrom to pass configmaps or secrets as environment
  envFrom: []
  #   - configMapRef:
  #       name: <CONFIGMAP_NAME>
  #   - secretRef:
  #       name: <SECRET_NAME>

  # datadog.env -- Set environment variables for all Agents
  ## The Datadog Agent supports many environment variables.
  ## ref: https://docs.datadoghq.com/agent/docker/?tab=standard#environment-variables
  env: []
          #     - name: DD_LOGS_CONFIG_OPEN_FILES_LIMIT
          #       value: 200
          #     - name: DD_CRI_SOCKET_PATH
          #       value: /var/run/containerd/containerd.sock
          #     - name: DD_APM_ENABLED
          #       value: true
  # datadog.confd -- Provide additional check configurations (static and Autodiscovery)
  ## Each key becomes a file in /conf.d
  ## ref: https://github.com/DataDog/datadog-agent/tree/main/Dockerfiles/agent#optional-volumes
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/
  confd: 
    containerd.yaml: |-
      ad_identifiers:
        - _containerd
      init_config:
      instances:
        - 
         filters:
           - topic=="/tasks/oom"
           - topic=="/tasks/delete"
           - topic=="/tasks/exit"
         collect_events: true
    disk.yaml: |-
      init_config:
      instances:
        - use_mount: false
          file_system_exclude:
            - autofs$
          mount_point_exclude:
            - /proc/sys/fs/binfmt_misc
            - /host/proc/sys/fs/binfmt_misc
  #   kubernetes_state.yaml: |-
  #     ad_identifiers:
  #       - kube-state-metrics
  #     init_config:
  #     instances:
  #       - kube_state_url: http://%%host%%:8080/metrics

  # datadog.checksd -- Provide additional custom checks as python code
  ## Each key becomes a file in /checks.d
  ## ref: https://github.com/DataDog/datadog-agent/tree/main/Dockerfiles/agent#optional-volumes
  checksd:
  #   service.py: |-

  # datadog.dockerSocketPath -- Path to the docker socket
  dockerSocketPath:  # /var/run/docker.sock

  # datadog.criSocketPath -- Path to the container runtime socket (if different from Docker)
  criSocketPath: /var/run/containerd/containerd.sock

  ## Enable process agent and provide custom configs
  processAgent:
    # datadog.processAgent.enabled -- Set this to true to enable live process monitoring agent
    ## Note: /etc/passwd is automatically mounted to allow username resolution.
    ## ref: https://docs.datadoghq.com/graphing/infrastructure/process/#kubernetes-daemonset
    enabled: true

    # datadog.processAgent.processCollection -- Set this to true to enable process collection in process monitoring agent
    ## Requires processAgent.enabled to be set to true to have any effect
    processCollection: true

    # datadog.processAgent.stripProcessArguments -- Set this to scrub all arguments from collected processes
    ## Requires processAgent.enabled and processAgent.processCollection to be set to true to have any effect
    ## ref: https://docs.datadoghq.com/infrastructure/process/?tab=linuxwindows#process-arguments-scrubbing
    stripProcessArguments: false

    # datadog.processAgent.processDiscovery -- Enables or disables autodiscovery of integrations
    processDiscovery: false

  ## Enable systemProbe agent and provide custom configs
  systemProbe:

    # datadog.systemProbe.debugPort -- Specify the port to expose pprof and expvar for system-probe agent
    debugPort: 0

    # datadog.systemProbe.enableConntrack -- Enable the system-probe agent to connect to the netlink/conntrack subsystem to add NAT information to connection data
    ## Ref: http://conntrack-tools.netfilter.org/
    enableConntrack: true

    # datadog.systemProbe.seccomp -- Apply an ad-hoc seccomp profile to the system-probe agent to restrict its privileges
    ## Note that this will break `kubectl exec … -c system-probe -- /bin/bash`
    seccomp: localhost/system-probe

    # datadog.systemProbe.seccompRoot -- Specify the seccomp profile root directory
    seccompRoot: /var/lib/kubelet/seccomp

    # datadog.systemProbe.bpfDebug -- Enable logging for kernel debug
    bpfDebug: false

    # datadog.systemProbe.apparmor -- Specify a apparmor profile for system-probe
    apparmor: unconfined

    # datadog.systemProbe.enableTCPQueueLength -- Enable the TCP queue length eBPF-based check
    enableTCPQueueLength: false

    # datadog.systemProbe.enableOOMKill -- Enable the OOM kill eBPF-based check
    enableOOMKill: false

    # datadog.systemProbe.enableRuntimeCompiler -- Enable the runtime compiler for eBPF probes
    enableRuntimeCompiler: false

    # datadog.systemProbe.mountPackageManagementDirs -- Enables mounting of specific package management directories when runtime compilation is enabled
    mountPackageManagementDirs: []
    ## For runtime compilation to be able to download kernel headers, the host's package management folders
    ## must be mounted to the /host directory. For example, for Ubuntu & Debian the following mount would be necessary:
    # - name: "apt-config-dir"
    #   hostPath: /etc/apt
    #   mountPath: /host/etc/apt
    ## If this list is empty, then all necessary package management directories (for all supported OSs) will be mounted.

    # datadog.systemProbe.osReleasePath -- Specify the path to your os-release file if you don't want to attempt mounting all `/etc/*-release` file by default
    osReleasePath:

    # datadog.systemProbe.runtimeCompilationAssetDir -- Specify a directory for runtime compilation assets to live in
    runtimeCompilationAssetDir: /var/tmp/datadog-agent/system-probe

    # datadog.systemProbe.collectDNSStats -- Enable DNS stat collection
    collectDNSStats: true

    # datadog.systemProbe.maxTrackedConnections -- the maximum number of tracked connections
    maxTrackedConnections: 131072

    # datadog.systemProbe.conntrackMaxStateSize -- the maximum size of the userspace conntrack cache
    conntrackMaxStateSize: 131072  # 2 * maxTrackedConnections by default, per  https://github.com/DataDog/datadog-agent/blob/d1c5de31e1bba72dfac459aed5ff9562c3fdcc20/pkg/process/config/config.go#L229

    # datadog.systemProbe.conntrackInitTimeout -- the time to wait for conntrack to initialize before failing
    conntrackInitTimeout: 10s

  orchestratorExplorer:
    # datadog.orchestratorExplorer.enabled -- Set this to false to disable the orchestrator explorer
    ## This requires processAgent.enabled and clusterAgent.enabled to be set to true
    ## ref: TODO - add doc link
    enabled: true

    # datadog.orchestratorExplorer.container_scrubbing -- Enable the scrubbing of containers in the kubernetes resource YAML for sensitive information
    ## The container scrubbing is taking significant resources during data collection.
    ## If you notice that the cluster-agent uses too much CPU in larger clusters
    ## turning this option off will improve the situation.
    container_scrubbing:
      enabled: true

  networkMonitoring:
    # datadog.networkMonitoring.enabled -- Enable network performance monitoring
    enabled: false

  ## Universal Service Monitoring is currently in private beta.
  ## See https://www.datadoghq.com/blog/universal-service-monitoring-datadog/ for more details and private beta signup.
  serviceMonitoring:
    # datadog.serviceMonitoring.enabled -- Enable Universal Service Monitoring
    enabled: false

  ## Enable security agent and provide custom configs
  securityAgent:
    compliance:
      # datadog.securityAgent.compliance.enabled -- Set to true to enable Cloud Security Posture Management (CSPM)
      enabled: false

      # datadog.securityAgent.compliance.configMap -- Contains CSPM compliance benchmarks that will be used
      configMap:

      # datadog.securityAgent.compliance.checkInterval -- Compliance check run interval
      checkInterval: 20m

    runtime:
      # datadog.securityAgent.runtime.enabled -- Set to true to enable Cloud Workload Security (CWS)
      enabled: false

      policies:
        # datadog.securityAgent.runtime.policies.configMap -- Contains CWS policies that will be used
        configMap:

      syscallMonitor:
        # datadog.securityAgent.runtime.syscallMonitor.enabled -- Set to true to enable the Syscall monitoring (recommended for troubleshooting only)
        enabled: false

  ## Manage NetworkPolicy
  networkPolicy:
    # datadog.networkPolicy.create -- If true, create NetworkPolicy for all the components
    create: false

    # datadog.networkPolicy.flavor -- Flavor of the network policy to use.
    # Can be:
    # * kubernetes for networking.k8s.io/v1/NetworkPolicy
    # * cilium     for cilium.io/v2/CiliumNetworkPolicy
    flavor: kubernetes

    cilium:
      # datadog.networkPolicy.cilium.dnsSelector -- Cilium selector of the DNS server entity
      # @default -- kube-dns in namespace kube-system
      dnsSelector:
        toEndpoints:
          - matchLabels:
              "k8s:io.kubernetes.pod.namespace": kube-system
              "k8s:k8s-app": kube-dns

  ## Configure prometheus scraping autodiscovery
  ## ref: https://docs.datadoghq.com/agent/kubernetes/prometheus/
  prometheusScrape:
    # datadog.prometheusScrape.enabled -- Enable autodiscovering pods and services exposing prometheus metrics.
    enabled: false
    # datadog.prometheusScrape.serviceEndpoints -- Enable generating dedicated checks for service endpoints.
    serviceEndpoints: false
    # datadog.prometheusScrape.additionalConfigs -- Allows adding advanced openmetrics check configurations with custom discovery rules. (Requires Agent version 7.27+)
    additionalConfigs: []
      # -
      #   autodiscovery:
      #     kubernetes_annotations:
      #       include:
      #         custom_include_label: 'true'
      #       exclude:
      #         custom_exclude_label: 'true'
      #     kubernetes_container_names:
      #     - my-app
      #   configurations:
      #   - send_distribution_buckets: true
      #     timeout: 5

  # datadog.ignoreAutoConfig -- List of integration to ignore auto_conf.yaml.
  ## ref: https://docs.datadoghq.com/agent/faq/auto_conf/
  ignoreAutoConfig: []
  #  - redisdb
  #  - kubernetes_state

  # datadog.containerExclude -- Exclude containers from the Agent
  # Autodiscovery, as a space-sepatered list
  ## ref: https://docs.datadoghq.com/agent/guide/autodiscovery-management/?tab=containerizedagent#exclude-containers
  containerExclude: "image:docker.io/platform9/pf9-profile-agent image:platform9/pf9-sentry"

  # datadog.containerInclude -- Include containers in the Agent Autodiscovery,
  # as a space-separated list.  If a container matches an include rule, it’s
  # always included in the Autodiscovery
  ## ref: https://docs.datadoghq.com/agent/guide/autodiscovery-management/?tab=containerizedagent#include-containers
  containerInclude:

  # datadog.containerExcludeLogs -- Exclude logs from the Agent Autodiscovery,
  # as a space-separated list
  containerExcludeLogs:

  # datadog.containerIncludeLogs -- Include logs in the Agent Autodiscovery, as
  # a space-separated list
  containerIncludeLogs:

  # datadog.containerExcludeMetrics -- Exclude metrics from the Agent
  # Autodiscovery, as a space-separated list
  containerExcludeMetrics:

  # datadog.containerIncludeMetrics -- Include metrics in the Agent
  # Autodiscovery, as a space-separated list
  containerIncludeMetrics:

  # datadog.excludePauseContainer -- Exclude pause containers from the Agent
  # Autodiscovery.
  ## ref: https://docs.datadoghq.com/agent/guide/autodiscovery-management/?tab=containerizedagent#pause-containers
  excludePauseContainer: true

## This is the Datadog Cluster Agent implementation that handles cluster-wide
## metrics more cleanly, separates concerns for better rbac, and implements
## the external metrics API so you can autoscale HPAs based on datadog metrics
## ref: https://docs.datadoghq.com/agent/kubernetes/cluster/
clusterAgent:
  # clusterAgent.enabled -- Set this to false to disable Datadog Cluster Agent
  enabled: true

  ## Define the Datadog Cluster-Agent image to work with
  image:
    # clusterAgent.image.name -- Cluster Agent image name to use (relative to `registry`)
    name: cluster-agent

    # clusterAgent.image.tag -- Cluster Agent image tag to use
    tag: 1.17.0

    # clusterAgent.image.repository -- Override default registry + image.name for Cluster Agent
    repository:

    # clusterAgent.image.pullPolicy -- Cluster Agent image pullPolicy
    pullPolicy: IfNotPresent

    # clusterAgent.image.pullSecrets -- Cluster Agent repository pullSecret (ex: specify docker registry credentials)
    ## See https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
    pullSecrets: []
    #   - name: "<REG_SECRET>"

  # clusterAgent.securityContext -- Allows you to overwrite the default PodSecurityContext on the cluster-agent pods.
  securityContext: {}

  containers:
    clusterAgent:
      # clusterAgent.containers.clusterAgent.securityContext -- Specify securityContext on the cluster-agent container.
      securityContext: {}

  # clusterAgent.command -- Command to run in the Cluster Agent container as entrypoint
  command: []

  # clusterAgent.token -- Cluster Agent token is a preshared key between node agents and cluster agent (autogenerated if empty, needs to be at least 32 characters a-zA-z)
  token: ""

  # clusterAgent.tokenExistingSecret -- Existing secret name to use for Cluster Agent token
  tokenExistingSecret: ""

  # clusterAgent.replicas -- Specify the of cluster agent replicas, if > 1 it allow the cluster agent to work in HA mode.
  replicas: 1

  ## Provide Cluster Agent Deployment pod(s) RBAC configuration
  rbac:
    # clusterAgent.rbac.create -- If true, create & use RBAC resources
    create: true

    # clusterAgent.rbac.serviceAccountName -- Specify a preexisting ServiceAccount to use if clusterAgent.rbac.create is false
    serviceAccountName: default

    # clusterAgent.rbac.serviceAccountAnnotations -- Annotations to add to the ServiceAccount if clusterAgent.rbac.create is true
    serviceAccountAnnotations: {}

  ## Provide Cluster Agent pod security configuration
  podSecurity:
    podSecurityPolicy:
      # clusterAgent.podSecurity.podSecurityPolicy.create -- If true, create a PodSecurityPolicy resource for Cluster Agent pods
      create: false
    securityContextConstraints:
      # clusterAgent.podSecurity.securityContextConstraints.create -- If true, create a SCC resource for Cluster Agent pods
      create: false

  # Enable the metricsProvider to be able to scale based on metrics in Datadog
  metricsProvider:
    # clusterAgent.metricsProvider.enabled -- Set this to true to enable Metrics Provider
    enabled: false

    # clusterAgent.metricsProvider.wpaController -- Enable informer and controller of the watermark pod autoscaler
    ## NOTE: You need to install the `WatermarkPodAutoscaler` CRD before
    wpaController: false

    # clusterAgent.metricsProvider.useDatadogMetrics -- Enable usage of DatadogMetric CRD to autoscale on arbitrary Datadog queries
    ## NOTE: It will install DatadogMetrics CRD automatically (it may conflict with previous installations)
    useDatadogMetrics: false

    # clusterAgent.metricsProvider.createReaderRbac -- Create `external-metrics-reader` RBAC automatically (to allow HPA to read data from Cluster Agent)
    createReaderRbac: true

    # clusterAgent.metricsProvider.aggregator -- Define the aggregator the cluster agent will use to process the metrics. The options are (avg, min, max, sum)
    aggregator: avg

    ## Configuration for the service for the cluster-agent metrics server
    service:
      # clusterAgent.metricsProvider.service.type -- Set type of cluster-agent metrics server service
      type: ClusterIP

      # clusterAgent.metricsProvider.service.port -- Set port of cluster-agent metrics server service (Kubernetes >= 1.15)
      port: 8443

    # clusterAgent.metricsProvider.endpoint -- Override the external metrics provider endpoint. If not set, the cluster-agent defaults to `datadog.site`
    endpoint:  # https://api.datadoghq.com

  # clusterAgent.env -- Set environment variables specific to Cluster Agent
  ## The Cluster-Agent supports many additional environment variables
  ## ref: https://docs.datadoghq.com/agent/cluster_agent/commands/#cluster-agent-options
  env: 
     - name: DD_LOGS_CONFIG_OPEN_FILES_LIMIT
       value: 200
  # clusterAgent.envFrom --  Set environment variables specific to Cluster Agent from configMaps and/or secrets
  ## The Cluster-Agent supports many additional environment variables
  ## ref: https://docs.datadoghq.com/agent/cluster_agent/commands/#cluster-agent-options
  envFrom: []
  #   - configMapRef:
  #       name: <CONFIGMAP_NAME>
  #   - secretRef:
  #       name: <SECRET_NAME>

  admissionController:
    # clusterAgent.admissionController.enabled -- Enable the admissionController to be able to inject APM/Dogstatsd config and standard tags (env, service, version) automatically into your pods
    enabled: true

    # clusterAgent.admissionController.mutateUnlabelled -- Enable injecting config without having the pod label 'admission.datadoghq.com/enabled="true"'
    mutateUnlabelled: true

  # clusterAgent.confd -- Provide additional cluster check configurations. Each key will become a file in /conf.d.
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/
  confd: {}
  #   mysql.yaml: |-
  #     cluster_check: true
  #     instances:
  #       - host: <EXTERNAL_IP>
  #         port: 3306
  #         username: datadog
  #         password: <YOUR_CHOSEN_PASSWORD>

  # clusterAgent.advancedConfd -- Provide additional cluster check configurations. Each key is an integration containing several config files.
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/
  advancedConfd: {}
  #  mysql.d:
  #    1.yaml: |-
  #      cluster_check: true
  #      instances:
  #        - host: <EXTERNAL_IP>
  #          port: 3306
  #          username: datadog
  #          password: <YOUR_CHOSEN_PASSWORD>
  #    2.yaml:  |-
  #      cluster_check: true
  #      instances:
  #        - host: <EXTERNAL_IP>
  #          port: 3306
  #          username: datadog
  #          password: <YOUR_CHOSEN_PASSWORD>

  # clusterAgent.resources -- Datadog cluster-agent resource requests and limits.
  resources: 
   requests:
     cpu: 200m
     memory: 256Mi
   limits:
     cpu: 200m
     memory: 256Mi

  # clusterAgent.priorityClassName -- Name of the priorityClass to apply to the Cluster Agent
  priorityClassName:  # system-cluster-critical

  # clusterAgent.nodeSelector -- Allow the Cluster Agent Deployment to be scheduled on selected nodes
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  nodeSelector: {}

  # clusterAgent.affinity -- Allow the Cluster Agent Deployment to schedule using affinity rules
  ## By default, Cluster Agent Deployment Pods are forced to run on different Nodes.
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}

  # clusterAgent.healthPort -- Port number to use in the Cluster Agent for the healthz endpoint
  healthPort: 5556

  # clusterAgent.livenessProbe -- Override default Cluster Agent liveness probe settings
  # @default -- Every 15s / 6 KO / 1 OK
  livenessProbe:
    initialDelaySeconds: 15
    periodSeconds: 15
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6

  # clusterAgent.readinessProbe -- Override default Cluster Agent readiness probe settings
  # @default -- Every 15s / 6 KO / 1 OK
  readinessProbe:
    initialDelaySeconds: 15
    periodSeconds: 15
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6

  # clusterAgent.strategy -- Allow the Cluster Agent deployment to perform a rolling update on helm update
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

  # clusterAgent.deploymentAnnotations -- Annotations to add to the cluster-agents's deployment
  deploymentAnnotations: {}
  #   key: "value"

  # clusterAgent.podAnnotations -- Annotations to add to the cluster-agents's pod(s)
  podAnnotations: {}
  #   key: "value"

  # clusterAgent.useHostNetwork -- Bind ports on the hostNetwork
  ## Useful for CNI networking where hostPort might
  ## not be supported. The ports need to be available on all hosts. It can be
  ## used for custom metrics instead of a service endpoint.
  ##
  ## WARNING: Make sure that hosts using this are properly firewalled otherwise
  ## metrics and traces are accepted from any host able to connect to this host.
  #
  useHostNetwork: false

  # clusterAgent.dnsConfig -- Specify dns configuration options for datadog cluster agent containers e.g ndots
  ## ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config
  dnsConfig: {}
  #  options:
  #  - name: ndots
  #    value: "1"

  # clusterAgent.volumes -- Specify additional volumes to mount in the cluster-agent container
  volumes: 
    - hostPath:
        path: /var/run/containerd/containerd.sock
      name: containerdsocket
    - hostPath:
        path: /var/run
      name: var-run

  # clusterAgent.volumeMounts -- Specify additional volumes to mount in the cluster-agent container
  volumeMounts:
    - name: containerdsocket
      mountPath: /var/run/containerd/containerd.sock
    - mountPath: /host/var/run
      name: var-run
      readOnly: true

  # clusterAgent.datadog_cluster_yaml -- Specify custom contents for the datadog cluster agent config (datadog-cluster.yaml)
  datadog_cluster_yaml: {}

  # clusterAgent.createPodDisruptionBudget -- Create pod disruption budget for Cluster Agent deployments
  createPodDisruptionBudget: false

  networkPolicy:
    # clusterAgent.networkPolicy.create -- If true, create a NetworkPolicy for the cluster agent.
    # DEPRECATED. Use datadog.networkPolicy.create instead
    create: false

  # clusterAgent.additionalLabels -- Adds labels to the Cluster Agent deployment and pods
  additionalLabels: {}
    # key: "value"

## This section lets you configure the agents deployed by this chart to connect to a Cluster Agent
## deployed independently
existingClusterAgent:
  # existingClusterAgent.join -- set this to true if you want the agents deployed by this chart to
  # connect to a Cluster Agent deployed independently
  join: false

  # existingClusterAgent.tokenSecretName -- Existing secret name to use for external Cluster Agent token
  tokenSecretName:  # <EXISTING_DCA_SECRET_NAME>

  # existingClusterAgent.serviceName -- Existing service name to use for reaching the external Cluster Agent
  serviceName:  # <EXISTING_DCA_SERVICE_NAME>

  # existingClusterAgent.clusterchecksEnabled -- set this to false if you don’t want the agents to run the cluster checks of the joined external cluster agent
  clusterchecksEnabled: true

agents:
  # agents.enabled -- You should keep Datadog DaemonSet enabled!
  ## The exceptional case could be a situation when you need to run
  ## single Datadog pod per every namespace, but you do not need to
  ## re-create a DaemonSet for every non-default namespace install.
  ## Note: StatsD and DogStatsD work over UDP, so you may not
  ## get guaranteed delivery of the metrics in Datadog-per-namespace setup!
  #
  enabled: true

  ## Define the Datadog image to work with
  image:
    # agents.image.name -- Datadog Agent image name to use (relative to `registry`)
    ## use "dogstatsd" for Standalone Datadog Agent DogStatsD 7
    name: agent

    # agents.image.tag -- Define the Agent version to use
    tag: 7.33.0

    # agents.image.tagSuffix -- Suffix to append to Agent tag
    ## Ex:
    ##  jmx        to enable jmx fetch collection
    ##  servercore to get Windows images based on servercore
    tagSuffix: ""

    # agents.image.repository -- Override default registry + image.name for Agent
    repository:

    # agents.image.doNotCheckTag -- Skip the version<>chart compatibility check
    ## By default, the version passed in agents.image.tag is checked
    ## for compatibility with the version of the chart.
    ## This boolean permits to completely skip this check.
    ## This is useful, for example, for custom tags that are not
    ## respecting semantic versioning
    doNotCheckTag:  # false

    # agents.image.pullPolicy -- Datadog Agent image pull policy
    pullPolicy: IfNotPresent

    # agents.image.pullSecrets -- Datadog Agent repository pullSecret (ex: specify docker registry credentials)
    ## See https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
    pullSecrets: []
    #   - name: "<REG_SECRET>"

  ## Provide Daemonset RBAC configuration
  rbac:
    # agents.rbac.create -- If true, create & use RBAC resources
    create: true

    # agents.rbac.serviceAccountName -- Specify a preexisting ServiceAccount to use if agents.rbac.create is false
    serviceAccountName: default

    # agents.rbac.serviceAccountAnnotations -- Annotations to add to the ServiceAccount if agents.rbac.create is true
    serviceAccountAnnotations: {}

  ## Provide Daemonset PodSecurityPolicy configuration
  podSecurity:
    podSecurityPolicy:
      # agents.podSecurity.podSecurityPolicy.create -- If true, create a PodSecurityPolicy resource for Agent pods
      create: false

    securityContextConstraints:
      # agents.podSecurity.securityContextConstraints.create -- If true, create a SecurityContextConstraints resource for Agent pods
      create: false

    # agents.podSecurity.seLinuxContext -- Provide seLinuxContext configuration for PSP/SCC
    # @default -- Must run as spc_t
    seLinuxContext:
      rule: MustRunAs
      seLinuxOptions:
        user: system_u
        role: system_r
        type: spc_t
        level: s0

    # agents.podSecurity.privileged -- If true, Allow to run privileged containers
    privileged: false

    # agents.podSecurity.capabilities -- Allowed capabilities
    ## capabilities must contain all agents.containers.*.securityContext.capabilities.
    capabilities:
      - SYS_ADMIN
      - SYS_RESOURCE
      - SYS_PTRACE
      - NET_ADMIN
      - NET_BROADCAST
      - NET_RAW
      - IPC_LOCK
      - CHOWN
      - AUDIT_CONTROL
      - AUDIT_READ

    # agents.podSecurity.volumes -- Allowed volumes types
    volumes:
      - configMap
      - downwardAPI
      - emptyDir
      - hostPath
      - secret

    # agents.podSecurity.seccompProfiles -- Allowed seccomp profiles
    seccompProfiles:
      - "runtime/default"
      - "localhost/system-probe"

    apparmor:
      # agents.podSecurity.apparmor.enabled -- If true, enable apparmor enforcement
      ## see: https://kubernetes.io/docs/tutorials/clusters/apparmor/
      enabled: true

    # agents.podSecurity.apparmorProfiles -- Allowed apparmor profiles
    apparmorProfiles:
      - "runtime/default"
      - "unconfined"

    # agents.podSecurity.defaultApparmor -- Default AppArmor profile for all containers but system-probe
    defaultApparmor: runtime/default

  containers:
    agent:
      # agents.containers.agent.env -- Additional environment variables for the agent container
      env: []
      #         - name: DD_APM_ENABLED
      #     value: true
      # agents.containers.agent.envFrom -- Set environment variables specific to agent container from configMaps and/or secrets
      envFrom: []
      #   - configMapRef:
      #       name: <CONFIGMAP_NAME>
      #   - secretRef:
      #       name: <SECRET_NAME>

      # agents.containers.agent.logLevel -- Set logging verbosity, valid log levels are: trace, debug, info, warn, error, critical, and off
      ## If not set, fall back to the value of datadog.logLevel.
      logLevel:  # INFO

      # agents.containers.agent.resources -- Resource requests and limits for the agent container.
      resources: 
       requests:
         cpu: 200m
         memory: 256Mi
       limits:
         cpu: 200m
         memory: 256Mi

      # agents.containers.agent.healthPort -- Port number to use in the node agent for the healthz endpoint
      healthPort: 5555

      # agents.containers.agent.livenessProbe -- Override default agent liveness probe settings
      # @default -- Every 15s / 6 KO / 1 OK
      livenessProbe:
        initialDelaySeconds: 15
        periodSeconds: 15
        timeoutSeconds: 5
        successThreshold: 1
        failureThreshold: 6

      # agents.containers.agent.readinessProbe -- Override default agent readiness probe settings
      # @default -- Every 15s / 6 KO / 1 OK
      readinessProbe:
        initialDelaySeconds: 15
        periodSeconds: 15
        timeoutSeconds: 5
        successThreshold: 1
        failureThreshold: 6

      # agents.containers.agent.securityContext -- Allows you to overwrite the default container SecurityContext for the agent container.
      securityContext: {}

      # agents.containers.agent.ports -- Allows to specify extra ports (hostPorts for instance) for this container
      ports: []

    processAgent:
      # agents.containers.processAgent.env -- Additional environment variables for the process-agent container
      env: []

      # agents.containers.processAgent.envFrom -- Set environment variables specific to process-agent from configMaps and/or secrets
      envFrom: []
      #   - configMapRef:
      #       name: <CONFIGMAP_NAME>
      #   - secretRef:
      #       name: <SECRET_NAME>

      # agents.containers.processAgent.logLevel -- Set logging verbosity, valid log levels are: trace, debug, info, warn, error, critical, and off
      ## If not set, fall back to the value of datadog.logLevel.
      logLevel:  # INFO

      # agents.containers.processAgent.resources -- Resource requests and limits for the process-agent container
      resources: 
       requests:
         cpu: 100m
         memory: 200Mi
       limits:
         cpu: 100m
         memory: 200Mi

      # agents.containers.processAgent.securityContext -- Allows you to overwrite the default container SecurityContext for the process-agent container.
      securityContext: {}

      # agents.containers.processAgent.ports -- Allows to specify extra ports (hostPorts for instance) for this container
      ports: []

    traceAgent:
      # agents.containers.traceAgent.env -- Additional environment variables for the trace-agent container
      env:

      # agents.containers.traceAgent.envFrom -- Set environment variables specific to trace-agent from configMaps and/or secrets
      envFrom: []
      #   - configMapRef:
      #       name: <CONFIGMAP_NAME>
      #   - secretRef:
      #       name: <SECRET_NAME>

      # agents.containers.traceAgent.logLevel -- Set logging verbosity, valid log levels are: trace, debug, info, warn, error, critical, and off
      logLevel:  # INFO

      # agents.containers.traceAgent.resources -- Resource requests and limits for the trace-agent container
      resources: 
       requests:
         cpu: 100m
         memory: 200Mi
       limits:
         cpu: 100m
         memory: 200Mi

      # agents.containers.traceAgent.livenessProbe -- Override default agent liveness probe settings
      # @default -- Every 15s
      livenessProbe:
        initialDelaySeconds: 15
        periodSeconds: 15
        timeoutSeconds: 5

      # agents.containers.traceAgent.securityContext -- Allows you to overwrite the default container SecurityContext for the trace-agent container.
      securityContext: {}

      # agents.containers.traceAgent.ports -- Allows to specify extra ports (hostPorts for instance) for this container
      ports: []

    systemProbe:
      # agents.containers.systemProbe.env -- Additional environment variables for the system-probe container
      env: []

      # agents.containers.systemProbe.envFrom -- Set environment variables specific to system-probe from configMaps and/or secrets
      envFrom: []
      #   - configMapRef:
      #       name: <CONFIGMAP_NAME>
      #   - secretRef:
      #       name: <SECRET_NAME>

      # agents.containers.systemProbe.logLevel -- Set logging verbosity, valid log levels are: trace, debug, info, warn, error, critical, and off.
      ## If not set, fall back to the value of datadog.logLevel.
      logLevel:  # INFO

      # agents.containers.systemProbe.resources -- Resource requests and limits for the system-probe container
      resources: 
       requests:
         cpu: 100m
         memory: 200Mi
       limits:
         cpu: 100m
         memory: 200Mi

      # agents.containers.systemProbe.securityContext -- Allows you to overwrite the default container SecurityContext for the system-probe container.
      ## agents.podSecurity.capabilities must reflect the changed made in securityContext.capabilities.
      securityContext:
        privileged: false
        capabilities:
          add: ["SYS_ADMIN", "SYS_RESOURCE", "SYS_PTRACE", "NET_ADMIN", "NET_BROADCAST", "NET_RAW", "IPC_LOCK", "CHOWN"]

      # agents.containers.systemProbe.ports -- Allows to specify extra ports (hostPorts for instance) for this container
      ports: []

    securityAgent:
      # agents.containers.securityAgent.env -- Additional environment variables for the security-agent container
      env:

      # agents.containers.securityAgent.envFrom -- Set environment variables specific to security-agent from configMaps and/or secrets
      envFrom: []
      #   - configMapRef:
      #       name: <CONFIGMAP_NAME>
      #   - secretRef:
      #       name: <SECRET_NAME>

      # agents.containers.securityAgent.logLevel -- Set logging verbosity, valid log levels are: trace, debug, info, warn, error, critical, and off
      ## If not set, fall back to the value of datadog.logLevel.
      logLevel:  # INFO

      # agents.containers.securityAgent.resources -- Resource requests and limits for the security-agent container
      resources: {}
      #  requests:
      #    cpu: 100m
      #    memory: 200Mi
      #  limits:
      #    cpu: 100m
      #    memory: 200Mi

      # agents.containers.securityAgent.ports -- Allows to specify extra ports (hostPorts for instance) for this container
      ports: []

    initContainers:
      # agents.containers.initContainers.resources -- Resource requests and limits for the init containers
      resources: {}
      #  requests:
      #    cpu: 100m
      #    memory: 200Mi
      #  limits:
      #    cpu: 100m
      #    memory: 200Mi

  # agents.volumes -- Specify additional volumes to mount in the dd-agent container
  volumes: []
  #   - hostPath:
  #       path: <HOST_PATH>
  #     name: <VOLUME_NAME>

  # agents.volumeMounts -- Specify additional volumes to mount in all containers of the agent pod
  volumeMounts: []
  #   - name: <VOLUME_NAME>
  #     mountPath: <CONTAINER_PATH>
  #     readOnly: true

  # agents.useHostNetwork -- Bind ports on the hostNetwork
  ## Useful for CNI networking where hostPort might
  ## not be supported. The ports need to be available on all hosts. It Can be
  ## used for custom metrics instead of a service endpoint.
  ##
  ## WARNING: Make sure that hosts using this are properly firewalled otherwise
  ## metrics and traces are accepted from any host able to connect to this host.
  useHostNetwork: false

  # agents.dnsConfig -- specify dns configuration options for datadog cluster agent containers e.g ndots
  ## ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config
  dnsConfig: {}
  #  options:
  #  - name: ndots
  #    value: "1"

  # agents.daemonsetAnnotations -- Annotations to add to the DaemonSet
  daemonsetAnnotations: {}
  #   key: "value"

  # agents.podAnnotations -- Annotations to add to the DaemonSet's Pods
  podAnnotations: {}
  #   <POD_ANNOTATION>: '[{"key": "<KEY>", "value": "<VALUE>"}]'

  # agents.tolerations -- Allow the DaemonSet to schedule on tainted nodes (requires Kubernetes >= 1.6)
  tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

  # agents.nodeSelector -- Allow the DaemonSet to schedule on selected nodes
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  nodeSelector: {}

  # agents.affinity -- Allow the DaemonSet to schedule using affinity rules
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}

  # agents.updateStrategy -- Allow the DaemonSet to perform a rolling update on helm update
  ## ref: https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: "10%"

  # agents.priorityClassCreate -- Creates a priorityClass for the Datadog Agent's Daemonset pods.
  priorityClassCreate: false

  # agents.priorityClassName -- Sets PriorityClassName if defined
  priorityClassName:

  # agents.priorityClassValue -- Value used to specify the priority of the scheduling of Datadog Agent's Daemonset pods.
  ## The PriorityClass uses PreemptLowerPriority.
  priorityClassValue: 1000000000

  # agents.podLabels -- Sets podLabels if defined
  # Note: These labels are also used as label selectors so they are immutable.
  podLabels: {}

  # agents.additionalLabels -- Adds labels to the Agent daemonset and pods
  additionalLabels: {}
    # key: "value"

  # agents.useConfigMap -- Configures a configmap to provide the agent configuration. Use this in combination with the `agents.customAgentConfig` parameter.
  useConfigMap:  # false

  # agents.customAgentConfig -- Specify custom contents for the datadog agent config (datadog.yaml)
  ## ref: https://docs.datadoghq.com/agent/guide/agent-configuration-files/?tab=agentv6
  ## ref: https://github.com/DataDog/datadog-agent/blob/main/pkg/config/config_template.yaml
  ## Note the `agents.useConfigMap` needs to be set to `true` for this parameter to be taken into account.
  customAgentConfig: {}
  #   # Autodiscovery for Kubernetes
  #   listeners:
  #     - name: kubelet
  #   config_providers:
  #     - name: kubelet
  #       polling: true
  #     # needed to support legacy docker label config templates
  #     - name: docker
  #       polling: true
  #
  #   # Enable java cgroup handling. Only one of those options should be enabled,
  #   # depending on the agent version you are using along that chart.
  #
  #   # agent version < 6.15
  #   # jmx_use_cgroup_memory_limit: true
  #
  #   # agent version >= 6.15
  #   # jmx_use_container_support: true

  networkPolicy:
    # agents.networkPolicy.create -- If true, create a NetworkPolicy for the agents.
    # DEPRECATED. Use datadog.networkPolicy.create instead
    create: false

  localService:
    # agents.localService.overrideName -- Name of the internal traffic service to target the agent running on the local node
    overrideName: ""

    # agents.localService.forceLocalServiceEnabled -- Force the creation of the internal traffic policy service to target the agent running on the local node.
    # By default, the internal traffic service is created only on Kubernetes 1.22+ where the feature became beta and enabled by default.
    # This option allows to force the creation of the internal traffic service on kubernetes 1.21 where the feature was alpha and required a feature gate to be explicitly enabled.
    forceLocalServiceEnabled: true

clusterChecksRunner:
  # clusterChecksRunner.enabled -- If true, deploys agent dedicated for running the Cluster Checks instead of running in the Daemonset's agents.
  ## ref: https://docs.datadoghq.com/agent/autodiscovery/clusterchecks/
  enabled: false

  ## Define the Datadog image to work with.
  image:
    # clusterChecksRunner.image.name -- Datadog Agent image name to use (relative to `registry`)
    name: agent

    # clusterChecksRunner.image.tag -- Define the Agent version to use
    tag: 7.33.0

    # clusterChecksRunner.image.tagSuffix -- Suffix to append to Agent tag
    ## Ex:
    ##  jmx        to enable jmx fetch collection
    ##  servercore to get Windows images based on servercore
    tagSuffix: ""

    # clusterChecksRunner.image.repository -- Override default registry + image.name for Cluster Check Runners
    repository:

    # clusterChecksRunner.image.pullPolicy -- Datadog Agent image pull policy
    pullPolicy: IfNotPresent

    # clusterChecksRunner.image.pullSecrets -- Datadog Agent repository pullSecret (ex: specify docker registry credentials)
    ## See https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
    pullSecrets: []
    #   - name: "<REG_SECRET>"

  # clusterChecksRunner.createPodDisruptionBudget -- Create the pod disruption budget to apply to the cluster checks agents
  createPodDisruptionBudget: false

  # Provide Cluster Checks Deployment pods RBAC configuration
  rbac:
    # clusterChecksRunner.rbac.create -- If true, create & use RBAC resources
    create: true

    # clusterChecksRunner.rbac.dedicated -- If true, use a dedicated RBAC resource for the cluster checks agent(s)
    dedicated: false

    # clusterChecksRunner.rbac.serviceAccountAnnotations -- Annotations to add to the ServiceAccount if clusterChecksRunner.rbac.dedicated is true
    serviceAccountAnnotations: {}

    # clusterChecksRunner.rbac.serviceAccountName -- Specify a preexisting ServiceAccount to use if clusterChecksRunner.rbac.create is false
    serviceAccountName: default

  # clusterChecksRunner.replicas -- Number of Cluster Checks Runner instances
  ## If you want to deploy the clusterChecks agent in HA, keep at least clusterChecksRunner.replicas set to 2.
  ## And increase the clusterChecksRunner.replicas according to the number of Cluster Checks.
  replicas: 2

  # clusterChecksRunner.resources -- Datadog clusterchecks-agent resource requests and limits.
  resources: 
   requests:
     cpu: 200m
     memory: 500Mi
   limits:
     cpu: 200m
     memory: 500Mi

  # clusterChecksRunner.affinity -- Allow the ClusterChecks Deployment to schedule using affinity rules.
  ## By default, ClusterChecks Deployment Pods are preferred to run on different Nodes.
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}

  # clusterChecksRunner.strategy -- Allow the ClusterChecks deployment to perform a rolling update on helm update
  ## ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

  # clusterChecksRunner.dnsConfig -- specify dns configuration options for datadog cluster agent containers e.g ndots
  ## ref: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config
  dnsConfig: {}
  #  options:
  #  - name: ndots
  #    value: "1"

  # clusterChecksRunner.priorityClassName -- Name of the priorityClass to apply to the Cluster checks runners
  priorityClassName:  # system-cluster-critical

  # clusterChecksRunner.nodeSelector -- Allow the ClusterChecks Deployment to schedule on selected nodes
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  #
  nodeSelector: {}

  # clusterChecksRunner.tolerations -- Tolerations for pod assignment
  ## Ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  #
  tolerations: []

  # clusterChecksRunner.healthPort -- Port number to use in the Cluster Checks Runner for the healthz endpoint
  healthPort: 5557

  # clusterChecksRunner.livenessProbe -- Override default agent liveness probe settings
  # @default -- Every 15s / 6 KO / 1 OK
  ## In case of issues with the probe, you can disable it with the
  ## following values, to allow easier investigating:
  #
  # livenessProbe:
  #   exec:
  #     command: ["/bin/true"]
  #
  livenessProbe:
    initialDelaySeconds: 15
    periodSeconds: 15
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6

  # clusterChecksRunner.readinessProbe -- Override default agent readiness probe settings
  # @default -- Every 15s / 6 KO / 1 OK
  ## In case of issues with the probe, you can disable it with the
  ## following values, to allow easier investigating:
  #
  # readinessProbe:
  #   exec:
  #     command: ["/bin/true"]
  #
  readinessProbe:
    initialDelaySeconds: 15
    periodSeconds: 15
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6

  # clusterChecksRunner.deploymentAnnotations -- Annotations to add to the cluster-checks-runner's Deployment
  deploymentAnnotations: {}
  #   key: "value"

  # clusterChecksRunner.podAnnotations -- Annotations to add to the cluster-checks-runner's pod(s)
  podAnnotations: {}
  #   key: "value"

  # clusterChecksRunner.env -- Environment variables specific to Cluster Checks Runner
  ## ref: https://github.com/DataDog/datadog-agent/tree/main/Dockerfiles/agent#environment-variables
  env: 
     - name: DD_LOGS_CONFIG_OPEN_FILES_LIMIT
       value: 200

  # clusterChecksRunner.envFrom -- Set environment variables specific to Cluster Checks Runner from configMaps and/or secrets
  ## envFrom to pass configmaps or secrets as environment
  ## ref: https://github.com/DataDog/datadog-agent/tree/main/Dockerfiles/agent#environment-variables
  envFrom: []
  #   - configMapRef:
  #       name: <CONFIGMAP_NAME>
  #   - secretRef:
  #       name: <SECRET_NAME>

  # clusterChecksRunner.volumes -- Specify additional volumes to mount in the cluster checks container
  volumes: []
  #   - hostPath:
  #       path: <HOST_PATH>
  #     name: <VOLUME_NAME>

  # clusterChecksRunner.volumeMounts -- Specify additional volumes to mount in the cluster checks container
  volumeMounts: []
  #   - name: <VOLUME_NAME>
  #     mountPath: <CONTAINER_PATH>
  #     readOnly: true

  networkPolicy:
    # clusterChecksRunner.networkPolicy.create -- If true, create a NetworkPolicy for the cluster checks runners.
    # DEPRECATED. Use datadog.networkPolicy.create instead
    create: false

  # clusterChecksRunner.additionalLabels -- Adds labels to the cluster checks runner deployment and pods
  additionalLabels: {}
    # key: "value"

  # clusterChecksRunner.securityContext -- Allows you to overwrite the default PodSecurityContext on the clusterchecks pods.
  securityContext: {}

  # clusterChecksRunner.ports -- Allows to specify extra ports (hostPorts for instance) for this container
  ports: []

datadog-crds:
  crds:
    # datadog-crds.crds.datadogMetrics -- Set to true to deploy the DatadogMetrics CRD
    datadogMetrics: true

kube-state-metrics:
  rbac:
    # kube-state-metrics.rbac.create -- If true, create & use RBAC resources
    create: true

  serviceAccount:
    # kube-state-metrics.serviceAccount.create -- If true, create ServiceAccount, require rbac kube-state-metrics.rbac.create true
    create: true

    # kube-state-metrics.serviceAccount.name -- The name of the ServiceAccount to use.
    ## If not set and create is true, a name is generated using the fullname template
    name:

  # kube-state-metrics.resources -- Resource requests and limits for the kube-state-metrics container.
  resources: 
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 256Mi

  # kube-state-metrics.nodeSelector -- Node selector for KSM. KSM only supports Linux.
  nodeSelector:
    kubernetes.io/os: linux

  # # kube-state-metrics.image -- Override default image information for the kube-state-metrics container.
  # image:
  #  # kube-state-metrics.repository -- Override default image registry for the kube-state-metrics container.
  #  repository: k8s.gcr.io/kube-state-metrics/kube-state-metrics
  #  # kube-state-metrics.tag -- Override default image tag for the kube-state-metrics container.
  #  tag: v1.9.8
  #  # kube-state-metrics.pullPolicy -- Override default image pullPolicy for the kube-state-metrics container.
  #  pullPolicy: IfNotPresent

providers:
  gke:
    # providers.gke.autopilot -- Enables Datadog Agent deployment on GKE Autopilot
    autopilot: false

  eks:
    ec2:
      # providers.eks.ec2.useHostnameFromFile -- Use hostname from EC2 filesystem instead of fetching from metadata endpoint.
      ## When deploying to EC2-backed EKS infrastructure, there are situations where the
      ## IMDS metadata endpoint is not accesible to containers. This flag mounts the host's
      ## `/var/lib/cloud/data/instance-id` and uses that for Agent's hostname instead.
      useHostnameFromFile: false

jurschel commented 2 years ago

And the settings in the values yaml for apm don't enable it to work because the service is never created unless you do the force setting and when it is created it's created with the wrong name apm vs traceport which if you look in the node agent pods is what the trace-agent service is sending all it's traffic to... Also we are using .NET tracers so the UDS method of socket isn't available to us per the NOTE in the install documentation. I tried it anyway just didn't work.

clamoriniere commented 2 years ago

@jurschel

thanks for the information. I don't know if you saw on my initial question that we also need to know which Kubernetes version is used.

On which kubernetes version are you trying to install the agent? The minimum Kubernetes version to have the service created is 1.22.0+.

To give more context, the service need to be configured with internal traffic policy service because we want that the data send by the applicative pod only target the Datadog agent on the same none. If we use a Service between the application and the agent with a Kubernetes version < 1.22. the traffic will be load-balanced between all agent pods, which will break the metrics-apm trace tagging.

If you are using Kubernetes version < 1.22, the setup that you describe you will need to use a communication with an hostPort binded by the Datadog agent.

datadog:
  dogstatsd:
    useHostPort: true
  apm:
    portEnabled: true

For the second issue that you have mentioned about the port name in the service: traceport instead of apm. Indeed we define the port in the trace-agent container with the name traceport https://github.com/DataDog/helm-charts/blob/f1820b74bc1ec647c544b26f6d69e1ef01ce9974/charts/datadog/templates/_container-trace-agent.yaml#L18-L23

and in the service with name it apm. But, it should not be an issue, because we define the targetPort properly in the service, so the routine should be fine: https://github.com/DataDog/helm-charts/blob/f1820b74bc1ec647c544b26f6d69e1ef01ce9974/charts/datadog/templates/agent-services.yaml#L90-L93

jurschel commented 2 years ago

My version is up there in my post. I edited it right after I posted. 1.21.3. So is what your telling me that I have to run 2 copies of the agent? Which I was doing and which was working. But that obviously costs me 2x the money in agent fees... If I went with the daemonset install of this agent it would infact install the service I have manually had to install. So I don't quite understand the version difference... Nowhere does any of it say ONLY if your on 1.22 or above that I saw on any of the install instructions on datadoghq.com. I get that deep in the notes of the chart it mentions it but there is a disparity between the helm and daemonset installs I believe.

clamoriniere commented 2 years ago

Sorry if my previous answers weren't clear. I think there is a misunderstanding.

To better understand your previous setup, and why deploying the agent with the chart doesn't work for you, could you please answers these questions:

how do you previously deploy the Datadog agent? it was with a Daemonset?
who was creating the Service that you mention?

In parallel, to try better explain how the agent should be deployed on kubernetes with the helm chart

The recommended agent deployment on Kubernetes is with only one agent (pod) per Node/Host. It is what the datadog helm chart is doing.

prior to Kubernetes 1.22 (and still the default): the communication between applicative pods and the datadog agent Pod (daemonset) is to use an hostPort (enabled with datadog.apm.portEnabled:true) or with an UDS socket.
from Kubernetes 1.22: we have recently added the support of Service Internal Traffic Policy. (our doc public doesn't mention it yet)

With the hostport solution. The hostIP can be provided to the APM library thanks to the "k8s downward API" as documented here: https://docs.datadoghq.com/agent/kubernetes/apm/?tab=helm#configure-your-application-pods-in-order-to-communicate-with-the-datadog-agent We provided an admission controller to inject the configuration into your applicative pods: https://docs.datadoghq.com/agent/cluster_agent/admission_controller/#apm-and-dogstatsd

jurschel commented 2 years ago

Thanks for the questions. Initially, I just had a standard ubuntu node agent installed on these cluster nodes. However, I wasn't getting all the information I desired. So, I went ahead and used the helm chart install all per the instructions. APM worked... Obviously, because there is an Ubuntu NODE agent installed on the machine answering on port 8126... However, when I removed that agent APM stopped working but the helm installed agents etc all still were sending logs just APM stopped. Now that there is no agent on the node answering at the hostIP port on 8126 it ceases to function... I think either you have to state in the documentation that it REQUIRES two agent installs to get APM to work right or explain exactly how it is supposed to work if there is no agent on the hostIP that is returned by the downward API... When I did a netstat and grep on 8126 I did see dynamic node ports assigned on the hostIP that were from the k8s pod on a 10. address at port 8126. But these are random dynamic ports that will never match hostIP:8126.

clamoriniere commented 2 years ago

Sorry the PR was closed due to the github automation.

And thanks for the explanation. it is more clear now. the good new is that the helm chart already handle it 😃 having 2 agents is definitely not needed

To have the agent deployed with the helm check binding the host port 8126. you just need to enable it with the option:

datadog:
  apm:
    portEnabled: true

it should work as it was the case when the agent was installed on the ubuntu host.

In addition, if you want the same visibility on the host you can also use the option agents.useHostNetwork: true and the agent pod will be use the host network.

Let me know if this recommandation solved the issue?

jurschel commented 2 years ago

Thanks for responding. What I guess your missing is I've already done that portEnabled: true and it doesn't work without a NODE agent installed. I can try it with the agents.useHostNetwork: true as that hasn't been tried. That might be the key as it sounds like that would force the node agent to use the host networking instead of the pod networking. Standby I'll try that.

jurschel commented 2 years ago

Interestingly enough it looks like traces are still flowing even though now the trace-agent sidecar is failing it's liveness check... Maybe it's because of the name mismatch that your PR fixed?

Warning Unhealthy 106s (x16 over 6m46s) kubelet Liveness probe failed: dial tcp 192.168.4.23:8126: connect: connection refused

root@platform908:~# netstat |grep 8126 tcp 0 0 platform908.gmcps:47458 10.8.45.83:8126 TIME_WAIT tcp 0 0 platform908.gmcps:47332 10.8.45.83:8126 TIME_WAIT tcp 0 0 platform908.gmcps:47270 10.8.45.83:8126 TIME_WAIT tcp 0 0 platform908.gmcps:47388 10.8.45.83:8126 TIME_WAIT

jurschel commented 2 years ago

Nope. So far the only thing that works with the chart I've got is enable the APM settings portEnabled: true and set the agents.localService.forceLocalServiceEnabled to true as well as apply a service file that changes the name of the tcp port to traceport. Then APM works and doesn't fail liveness checks. I get the feeling that this is being tested on nodes that have the full host agent installed on the nodes because if you were doing what I'm doing you would see it doesn't work unless you do it the way I'm doing it...

clamoriniere commented 2 years ago

I think we are getting closer to solution.

We will need an agent flare to investigate why the trace-agent is not mounting the port properly. Could generate a flare and send it to our support team.

Also if you can check in the Daemonset spec if the hostport setting is present in the trace-agent container.

Thanks

jurschel commented 2 years ago

You can review the several flare's I've sent up over this problem in this support ticket. 656356

jurschel commented 2 years ago

As an update here it appears that the root cause of these problems might actually be related to a calico open issue.

https://github.com/projectcalico/calico/issues/4617

CharlyF commented 2 years ago

Hi all,

I heard back from Google - The following change within the CNI plugin is the root-cause of the issue : portmap: Apply the DNAT hairpin to the whole subnet - https://github.com/containernetworking/plugins/pull/469

The fix is going to be available on the below versions: 1.23.5-gke.300+ 1.22.8-gke.300+ 1.21.10-gke.2100+ 1.20.15-gke.4200+ The ETA for the rollout to complete is ~2 weeks from now.

I will close this for now, we can follow up in the upstream repo if necessary.

DataDog / helm-charts

Datadog-Agent service isn't being created and thus APM isn't working... #527