DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.86k stars 1.2k forks source link

How to disable Infrastructure host metrics collection and enable only custom checks collection? #11618

Open karthikeayan opened 2 years ago

karthikeayan commented 2 years ago

Describe what happened: Unable to deploy Datadog Container Agent as pod with only custom checks.

I have deployed Datadog Kubernetes Helm Chart in the Kubernetes cluster. Datadog created a daemonset and deployed a pod in each node and pulls metrics from each node. I also want to deploy another Datadog agent as a pod that runs only the custom checks like mysql, postgres. It should not collect metrics of the host it is running. As host metrics will be collected with Daemonset.

Host metrics are tagged to the new host with Kubernetes pod name. image

When I follow this, https://docs.datadoghq.com/logs/guide/how-to-set-up-only-logs/?tab=kubernetes, no metrics are sent to Datadog, host metric and the custom check metrics.

Describe what you expected: Host should not appear in infrastructure list.

Steps to reproduce the issue: Deploy Datadog Helm Chart Create deployment with below values

ohookins commented 2 years ago

I'm having a similar problem here, although slightly different scenario. Wanting to only capture Postgres metrics, but am finding that the agent is capturing system metrics despite removing everything else in conf.d.

anden-dev commented 2 years ago

We are following the same idea, to have an agent running in eks to only do the rds checks.

Agent status looks good so far:

kubectl exec -it agent status

===============
Agent (v7.37.1)
===============

=========
Collector
=========

  Running Checks
  ==============

    postgres (12.4.0)
    -----------------
      Instance ID: postgres:6cb55c36780909a7 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/postgres.d/conf.yaml
      Total Runs: 10
      Metric Samples: Last Run: 305, Total: 2,881
      Events: Last Run: 0, Total: 0
      Database Monitoring Activity Samples: Last Run: 1, Total: 13
      Database Monitoring Query Metrics: Last Run: 2, Total: 14
      Database Monitoring Query Samples: Last Run: 3, Total: 237
      Service Checks: Last Run: 1, Total: 10
      Average Execution Time : 35ms
      Last Execution Date : 2022-07-07 09:19:08 UTC (1657185548000)
      Last Successful Execution Date : 2022-07-07 09:19:08 UTC (1657185548000)
      metadata:
        version.major: 12
        version.minor: 8
        version.patch: 0
        version.raw: 12.8
        version.scheme: semver      

We also remove the standard checks with a little bit of force as there is no Variable to toggle this:

    lifecycle:
          postStart:
            exec:
              command: ["/bin/sh", "-c", "find /etc/datadog-agent/conf.d/ -iname *.yaml.default -delete"]

So far so good.

I am now trying to fix these issues from the agents logs

2022-07-07 09:16:49 UTC | CORE | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
WARNING: `--config` argument is deprecated and will be removed in a future version. Please use `--cfgpath` instead.
2022-07-07 09:16:49 UTC | PROCESS | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2022-07-07 09:16:49 UTC | PROCESS | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2022-07-07 09:16:49 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:591 in func1) | Error loading config: open /etc/datadog-agent/system-probe.yaml: no such file or directory
2022-07-07 09:16:49 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2022-07-07 09:16:49 UTC | SECURITY | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2022-07-07 09:16:51 UTC | CORE | WARN | (pkg/serializer/serializer.go:144 in NewSerializer) | event payloads are disabled: all events will be dropped
2022-07-07 09:16:51 UTC | CORE | WARN | (pkg/serializer/serializer.go:147 in NewSerializer) | series payloads are disabled: all series will be dropped
2022-07-07 09:16:51 UTC | CORE | WARN | (pkg/serializer/serializer.go:150 in NewSerializer) | service_checks payloads are disabled: all service_checks will be dropped
2022-07-07 09:16:51 UTC | CORE | WARN | (pkg/serializer/serializer.go:153 in NewSerializer) | sketches payloads are disabled: all sketches will be dropped
2022-07-07 09:16:51 UTC | CORE | WARN | (pkg/secrets/secrets.go:50 in Init) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2022-07-07 09:16:52 UTC | CORE | WARN | (pkg/autodiscovery/providers/config_reader.go:156 in read) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
2022-07-07 09:16:52 UTC | CORE | WARN | (pkg/autodiscovery/providers/config_reader.go:156 in read) | Skipping, open : no such file or directory
2022-07-07 09:16:52 UTC | CORE | ERROR | (pkg/collector/scheduler.go:76 in Schedule) | Unable to run Check postgres: a check with ID postgres:6cb55c36780909a7 is already running
2022-07-07 09:16:53 UTC | CORE | WARN | (pkg/util/cloudproviders/gce/gce_tags.go:50 in getCachedTags) | unable to get tags from gce and cache is empty: GCE metadata API error: status code 401 trying to GET http://169.254.169.254/computeMetadata/v1/?recursive=true
2022-07-07 09:16:53 UTC | TRACE | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
system-probe exited with code 0, disabling
trace-agent exited with code 0, disabling
2022-07-07 09:17:21 UTC | CORE | ERROR | (pkg/metrics/iterable_series.go:55 in Append) | Cannot append a serie in a closed buffered channel
2022-07-07 09:19:13 UTC | PROCESS | WARN | (pkg/util/cloudproviders/gce/gce_tags.go:50 in getCachedTags) | unable to get tags from gce and cache is empty: GCE metadata API error: status code 401 trying to GET http://169.254.169.254/computeMetadata/v1/?recursive=true
2022-07-07 09:19:36 UTC | CORE | ERROR | (pkg/metrics/iterable_series.go:55 in Append) | Cannot append a serie in a closed buffered channel

UPDATE after enabling DD_ENABLE_PAYLOADS_SERIES these errors went away

2022-07-07 09:19:36 UTC | CORE | ERROR | (pkg/metrics/iterable_series.go:55 in Append) | Cannot append a serie in a closed buffered channel

setting

liveness Probe got rid of this

2022-07-07 10:35:04 UTC | CORE | ERROR | (pkg/collector/scheduler.go:76 in Schedule) | Unable to run Check postgres: a check with ID postgres:6cb55c36780909a7 is already running

Update: setting

2022-07-07 09:16:49 UTC | SYS-PROBE | WARN | (pkg/util/log/log.go:591 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
clatour commented 2 years ago

Might be more of an implementation detail, but because the datadog-agent container uses s6, there are some hooks where a user can dynamically mount shell scripts into /etc/cont-init.d which would have more of a guaranteed order of execution than what is provided by postStart:

There is no guarantee, however, that the postStart handler is called before the Container's entrypoint is called

So, the solution we took was to define a 99-delete-default-checks.sh with the same contents and mount it there.

Would it be a useful feature to consider adding this as an init script and then exposing it via a DD_DISABLE_DEFAULT_CHECKS (or something like it), environment variable?

mehdibenfeguir commented 11 months ago

@clatour could you please explain in details what is the content of the script 99-delete-default-checks.sh and for me, I'm using a helm chart to install datadog on k8s to just scrape mysql metrics, and I'm getting unwanted k8s metrics that I need to turn off tried this but it didn't worked

  --set 'datadog.kubeStateMetricsCore.enabled=false' \
  --set 'kube-state-metrics.serviceAccount.create=false' \