DataDog / datadog-operator

Datadog Agent Kubernetes Operator
Apache License 2.0
286 stars 99 forks source link

`/conf.d/` files are not created when using `extraConfd` and `v2alpha1` DatadogAgent #696

Open renchap opened 1 year ago

renchap commented 1 year ago

Describe what happened:

I am trying to configure cluster checks to monitor external managed services. Due to the issues described in #689 I need to use v2alpha1 DatadogAgent resource.

I am using a spec containing this:

  override:
   <runner>:
      extraConfd:
        configMap:
          name: datadog-confd-config
          items:
            - key: redisdb.yml
              path: redisdb.yml

When <runner> is clusterAgent, I can see that/conf.d/redisdb.ymlexists in thedatadog-cluster-agent` pod (with no effect, but thats another issue).

If <runner> is nodeAgent or clusterChecksRunner (with features.clusterChecks.useClusterChecksRunners enabled), then /conf.d/redisdb.yml is not created in those containers.

Describe what you expected:

I expect my redisdb.yml configuration file to be created in the containers and picked up by the running agent.

Steps to reproduce the issue:

Use this DatadogAgent file with Operator 1.0.0-rc7

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <API_KEY>
      appKey: <APP_KEY>
    kubelet:
      tlsVerify: false
  override:
    nodeAgent:
      extraConfd:
       configDataMap:
         redisdb.yaml: |-
            cluster_checks: true
            init_config:
            instances:
              - host: XXX
                port: 1234
                username: default
                password: XXX
                ssl: true

Then enter a datadog-agent pod, and /conf.d/ will be empty.

Additional environment details (Operating System, Cloud provider, etc):

Operator version : 1.0.0-rc7 Cloud provider: Exoscale (using their managed k8s offering) This is related to Datadog support ticket #1081449

khewonc commented 1 year ago

Hi, thanks for reaching out. To run a cluster check, the redisdb configuration yaml will need to be added to the cluster agent (DCA) as the DCA will be the one to schedule checks in the node agent or the cluster checks runners (CLC runners). We'll be sure to have this documented when the operator goes GA to help prevent further confusion. Another small nit in your yaml configuration, cluster_checks: true needs to be cluster_check: true: https://docs.datadoghq.com/containers/cluster_agent/clusterchecks/?tab=operator#configuration-from-configuration-files. Here’s an example snippet to run a cluster check in a node agent:

  override:
    clusterAgent:
      extraConfd:
       configDataMap:
         redisdb.yaml: |-
            cluster_check: true
            init_config:
            instances:
              - host: XXX
                port: 1234
                username: default
                password: XXX
                ssl: true

In the DCA, we can see which node the check is scheduled on:

root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# agent clusterchecks
[...]
===== Checks on <hostname> =====

=== redisdb check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/redisdb.yaml
Instance ID: redisdb:1bd7f42364a4def9
empty_default_hostname: true
host: XXX
password: XXX
port: 1234
ssl: true
username: default
~
===

And if we check the that node's agent status, the check should run. In this case, the error is expected since XXX:1234 isn't a valid address, but it at least shows that the configs we want are being used:

    redisdb (4.5.2)
    ---------------
      Instance ID: redisdb:1bd7f42364a4def9 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/redisdb.yaml
      Total Runs: 4
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 4
      Average Execution Time : 30ms
      Last Execution Date : 2023-02-03 20:59:13 UTC (1675457953000)
      Last Successful Execution Date : Never
      Error: Error -5 connecting to XXX:1234. No address associated with hostname.

The same idea applies for CLC runners. Configure the check on the DCA and the DCA will pass schedule the check on the CLC runners. Using this config:

  features:
    clusterChecks:
      useClusterChecksRunners: true
  override:
    clusterAgent:
      extraConfd:
       configDataMap:
         redisdb.yaml: |-
            cluster_check: true
            init_config:
            instances:
              - host: XXX
                port: 1234
                username: default
                password: XXX
                ssl: true

We can check the checks are going to the CLC runners:

root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# agent clusterchecks
[...]
===== Checks on datadog-cluster-checks-runner-xxxxxxxxxx-xxxxx =====
[...]

=== redisdb check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/redisdb.yaml
Instance ID: redisdb:1bd7f42364a4def9
empty_default_hostname: true
host: XXX
password: XXX
port: 1234
ssl: true
username: default
~
===

Running agent status in that CLC runner:

root@datadog-cluster-checks-runner-xxxxxxxxxx-xxxxx:/# agent status
[...]
    redisdb (4.5.2)
    ---------------
      Instance ID: redisdb:1bd7f42364a4def9 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/redisdb.yaml
      Total Runs: 37
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 37
      Average Execution Time : 8ms
      Last Execution Date : 2023-02-03 21:22:34 UTC (1675459354000)
      Last Successful Execution Date : Never
      Error: Error -5 connecting to XXX:1234. No address associated with hostname.

When adding an extraConfd to the node agent, you'll find the config at /etc/datadog-agent/conf.d/. As an example, for the config you posted to reproduce the issue, the config would be at /etc/datadog-agent/conf.d/redisdb.yaml:

root@datadog-agent-xxxxx:/# cat /etc/datadog-agent/conf.d/redisdb.yaml 
cluster_checks: true
init_config:
instances:
  - host: XXX
    port: 1234
    username: default
    password: XXX
    ssl: true

Hope that helps clear up some of the issues you were seeing with the check configs.

renchap commented 1 year ago

Thanks for having a look at this.

I made your change but I am not seeing the same results as you.

Here is my config:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  global:
    clusterName: xxx
    site: datadoghq.eu
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
      appSecret:
        secretName: datadog-secret
        keyName: app-key
    kubelet:
      tlsVerify: false
  features:
    liveContainerCollection:
      enabled: true
    liveProcessCollection:
      enabled: true
    oomKill:
      enabled: true
    prometheusScrape:
      enabled: false
    clusterChecks:
      enabled: true
      useClusterChecksRunners: true
  override:
    clusterAgent:
      extraConfd:
        configDataMap:
          redisdb.yml: |-
            cluster_check: true
            init_config:
            instances:
              - host: xxx
                port: xx
                username: default
                password: xxx
                ssl: true

I am running the latest 1.0.0-rc.8 operator.

Here the output for agent-status on the DCA pod:

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 19
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 3, Total: 25
      Service Checks: Last Run: 5, Total: 70
      Average Execution Time : 1.466s
      Last Execution Date : 2023-02-03 22:14:48 UTC (1675462488000)
      Last Successful Execution Date : 2023-02-03 22:14:48 UTC (1675462488000)

    kubernetes_state_core
    ---------------------
      Instance ID: kubernetes_state_core:b13e6c9d52886e07 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_state_core.d/kubernetes_state_core.yaml.default
      Total Runs: 18
      Metric Samples: Last Run: 1,600, Total: 22,298
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 8, Total: 112
      Average Execution Time : 8ms
      Last Execution Date : 2023-02-03 22:14:38 UTC (1675462478000)
      Last Successful Execution Date : 2023-02-03 22:14:38 UTC (1675462478000)

    orchestrator
    ------------
      Instance ID: orchestrator:e1ef8faec3fcbfc1 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/orchestrator.yaml
      Total Runs: 28
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 34ms
      Last Execution Date : 2023-02-03 22:14:46 UTC (1675462486000)
      Last Successful Execution Date : 2023-02-03 22:14:46 UTC (1675462486000)

No sign of redis here.

But this does not seem normal?

root@datadog-cluster-agent-7f7cc745c8-sdfzt:~# ls /conf.d/
redisdb.yml
root@datadog-cluster-agent-7f7cc745c8-sdfzt:~# ls /etc/datadog-agent/conf.d/
kubernetes_apiserver.d  kubernetes_state_core.d  orchestrator.d
khewonc commented 1 year ago

Hi, thanks for trying out the new configuration. Could you also try using redisdb.yaml as the key in the override section instead of redisdb.yml?

  override:
    clusterAgent:
      extraConfd:
        configDataMap:
          redisdb.yaml: |-
            cluster_check: true
            init_config:
            instances:
              - host: xxx
                port: xx
                username: default
                password: xxx
                ssl: true

Like you said, the location of the redisdb.yml file looked odd in the DCA. I would have expected it to be copied from /conf.d/ to /etc/datadog-agent/conf.d/. My guess is this happened because this line in the DCA entrypoint script copies over *.yaml files, but not *.yml files.

With redisdb.yml, the file only exists at /conf.d/:

root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# ls /conf.d/  
redisdb.yml
root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# ls /etc/datadog-agent/conf.d/
kubernetes_apiserver.d  kubernetes_state_core.d  orchestrator.d

But changing it to redisdb.yaml allows the file to be copied over to the correct place:

root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# ls /conf.d/
redisdb.yaml
root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# ls /etc/datadog-agent/conf.d/
kubernetes_apiserver.d  kubernetes_state_core.d  orchestrator.d  redisdb.yaml

Double checking with agent clusterchecks, I see that the config is now picked up:

root@datadog-cluster-agent-xxxxxxxxxx-xxxxx:/# agent clusterchecks
[...]
===== Checks on datadog-cluster-checks-runner-xxxxxxxxxx-xxxxx =====
[...]
=== redisdb check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/redisdb.yaml
Instance ID: redisdb:eb960cd1e44d41c4
empty_default_hostname: true
host: xxx
password: xxx
port: xx
ssl: true
tags:
- kube_cluster_name:kind-test
- cluster_name:kind-test
username: default
~
===
renchap commented 1 year ago

Thanks, it indeed works with a .yaml extension! It may be a good idea to handle both extensions :)

khewonc commented 1 year ago

Glad to hear that worked and thanks for the feedback. I'll add a card to the backlog to better support .yml extensions in the check configs.