Configuration sanity please!

SleepyBrett commented 6 years ago

Output of the info page (if this is a bug)

(Paste the output of the info page here)

Describe what happened: I would like to configure a number of agents on k8s to ONLY scrape ksm.

We find that rightsizing the dd agent daemonset is impossible for any cluster with any significant workload if you turn on ksm autodiscovery. This is because one, or multiple if you shard your ksm by collector, agent has much more work to do (scraping ksm) than the rest (that are only collecting container/node metrics).

I imagine this might also be a problem if I turned on event collection, the "leader" would also have much higher cpu/memory usage than the other nodes.

To that end I am attempting the following:

Turn off ksm autodiscovery in the main agent ds (no problems here)
Create two ksm pods, one ksm set up only to do pod and the other to do all other collectors ( my analysis suggests that kube_pods is by far the largest source of metrics on my clusters dwarfing all the other collectors combined).
Sidecar a datadog agent (version 6.2 currently, if it matters) into each pod configured to ONLY gather ksm by connecting to the ksm endpoint on localhost.

To that end I'm passing the following env variables to those dd containers:

        - name: DD_API_KEY
          valueFrom:
            secretKeyRef:
              key: api-key
              name: dd-prod-datadog
        - name: DD_LOG_LEVEL
          value: WARNING
        - name: DD_TAGS
          value: '[kube_cloud_provider:aws kube_cluster_id:psp kube_env:nonprod kube_region:us-west-2 ksm-test:pods]'
        - name: KUBERNETES_LEADER_CANDIDATE
          value: "false"
        - name: DD_LEADER_ELECTION
          value: "false"
        - name: KUBERNETES
          value: "no"
        - name: DOCKER_DD_AGENT
          value: "no"

Can I just step back a moment and say "true/false" OR "yes/no" .. maybe they are interchangeable, it's not at all clear...

and then mounting the following datadog.yaml into /etc/datadog-agent/

    listeners:

    config_providers:

and then mounting the following auto_conf.yaml into /conf.d

    ad_identifiers:
      - kube-state-metrics
    init_config:
    instances:
      - kube_state_url: http://127.0.0.1:8080/metrics

At this point I expect that I have told the agent DO NOTHING, except scrape 127.0.0.1:8080/metrics and ship the results.

However when i jump into that sidecar and run :

$ /opt/datadog-agent/bin/agent/agent status

=============
Agent (v6.2.0)
==============

  Status date: 2018-08-24 18:03:58.297650 UTC
  Pid: 344
  Python Version: 2.7.14
  Logs:
  Check Runners: 1
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2018-08-24 18:03:58.297650 UTC

  Host Info
  =========
    bootTime: 2018-08-10 18:11:14.000000 UTC
    kernelVersion: 4.14.59-coreos-r2
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.4
    procs: 74
    uptime: 1.204892e&#43;06
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    ec2-hostname: ip-172-16-199-184.us-west-2.compute.internal
    hostname: kube-state-metrics-pods-5769cbd774-xkcxc
    instance-id: i-0f92fb59076300e10
    socket-fqdn: kube-state-metrics-pods-5769cbd774-xkcxc
    socket-hostname: kube-state-metrics-pods-5769cbd774-xkcxc

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 284
      Metrics: 6, Total Metrics: 1698
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

    disk
    ----
      Total Runs: 284
      Metrics: 148, Total Metrics: 42032
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 66ms

    file_handle
    -----------
      Total Runs: 284
      Metrics: 1, Total Metrics: 284
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

    io
    --
      Total Runs: 284
      Metrics: 143, Total Metrics: 40513
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

    kubelet
    -------
      Total Runs: 284
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms
      Error: Unable to find metrics_endpoint in config file or detect the kubelet URL automatically.
      Traceback (most recent call last):
        File "/opt/datadog-agent/bin/agent/dist/checks/__init__.py", line 332, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py", line 111, in check
          raise CheckException("Unable to find metrics_endpoint in config "
      CheckException: Unable to find metrics_endpoint in config file or detect the kubelet URL automatically.

    kubernetes_apiserver
    --------------------
      Total Runs: 284
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

      Warning: [Leader Election not enabled. Not running Kubernetes API Server check or collecting Kubernetes Events.]

    load
    ----
      Total Runs: 284
      Metrics: 6, Total Metrics: 1704
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

    memory
    ------
      Total Runs: 284
      Metrics: 16, Total Metrics: 4544
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

    ntp
    ---
      Total Runs: 284
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 284
      Average Execution Time : 5001ms

    uptime
    ------
      Total Runs: 284
      Metrics: 1, Total Metrics: 284
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0
      Average Execution Time : 0ms

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 284
  IntakeV1: 23
  RetryQueueSize: 0
  Success: 591
  TimeseriesV1: 284

  API Keys status
  ===============
    https://6-2-0-app.agent.datadoghq.com,*************************2c443: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 96454
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 284
  Series Flushed: 66054
  Service Check: 3124
  Service Checks Flushed: 3397

So it looks like I still have several collectors running and some crashing... and it's not at all clear if the kubernetes_state "job" is even running.

Because the documentation isn't super clear (and is often telling me how to configure things in agent 5.x) I started digging into the agent code.

The way it's configured is very confusing. It seems like to me that the following things are happening:

1) s6 is used to start the agent and may do some things re: config before the agent even starts, this is not at all clear to me and I've chosen to mostly ignore it, though i'm not even sure why you would use s6 in a containerized env, philosophically.

1b) at some point s6 starts running things in /etc/cont-init.d These files start shuffling things around in your config dirs based on env variables/files on the filesystem/voodoo magic.

2) now the agent starts and it does yet more "magic", i think most of this magic is constrained to ./pkg/config/ but I can't be sure. Again you seem to be starting things based on some combination of env variables, files on the filesystem, etc. There seems to be some backwards compatibility built in (`/etc/dd-agent/`)

...

All this is to say that in an effort to be magical, it's very hard for someone who doesn't happen to be a datadog engineer to configure your agent by hand if that is the appropriate thing to do.

Describe what you expected: I expect ONLY the ksm metrics to be shipped from this sidecar container

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

hkaj commented 6 years ago

Hi @SleepyBrett Thanks for reaching out. I'll try to address all of your points, let me know if I missed some.

We find that rightsizing the dd agent daemonset is impossible for any cluster with any significant workload if you turn on ksm autodiscovery. This is because one, or multiple if you shard your ksm by collector, agent has much more work to do (scraping ksm) than the rest (that are only collecting container/node metrics).

We're aware of this issue, and we recommend sharding ksm per namespace, and keeping namespaces small. Not just for making the agent happy, we also observed it improves performance in large k8s clusters.

I imagine this might also be a problem if I turned on event collection, the "leader" would also have much higher cpu/memory usage than the other nodes.

You're right, although we're moving event collection to the cluster agent (not GA yet, but soon to be) so the issue will go away.

Create two ksm pods, one ksm set up only to do pod and the other to do all other collectors

This sharding is also the one we started with internally, but it doesn't solve the problem for two reasons:

pods need to be collocated with nodes to leverage label joining - this assigns pods to the right node, otherwise all pod metrics from ksm seem to come from the node running ksm: https://github.com/DataDog/integrations-core/blob/abe71fbe820ff68b79b3c18794bf39e167ab04cf/kubernetes_state/datadog_checks/kubernetes_state/kubernetes_state.py#L160-L165
as you mentioned, pod volume dwarfs every other collectors, so splitting by resource doesn't help that much

If splitting by namespace is not an option for you, splitting by collector is still your best bet for now. In our case we experimented with splitting out configmaps, endpoints, and services to smooth the load some more, but that depends on your workload, YMWV. You could also disable the collectors we don't collect metrics about. Unfortunately pods remain the main issue.

Can I just step back a moment and say "true/false" OR "yes/no" .. maybe they are interchangeable, it's not at all clear...

They are interchangeable in the config, but yes/no don't work well in env variables, so we're consolidating to true/false. See: https://github.com/DataDog/datadog-agent/pull/2171

and then mounting [a datadog.yaml with empty config_providers]

This is a good idea but the file config provider is initialized anyway, because the agent is supposed to run some checks by default. You can disable them by mounting an empty volume in place of /etc/datadog-agent/conf.d/ to remove the default check configs.

Again, we don't recommend going the sidecar route, the agent is not designed for this and things might break. Sharding ksm by namespace is more scalable in the long run. But if you're doing it anyway you may want to disable host metadata collection to avoid weird host duplication issues in the app: https://github.com/DataDog/datadog-agent/blob/aa3fd27e8c7351b19f243f3e2cca7498d96aa690/cmd/agent/app/start.go#L220-L228 (set DD_ENABLE_METADATA_COLLECTION to false)

One last point that will help some more soon: we're working on a revamp of OpenMetrics parsing which shows promising results, performance wise. You can expect the load of ksm parsing to drop in a soon-to-come release

Hope that helps.

SleepyBrett commented 6 years ago

We're aware of this issue, and we recommend sharding ksm per namespace, and keeping namespaces small. Not just for making the agent happy, we also observed it improves performance in large k8s clusters.

As the manager of several multi-tennant clusters I can't see this as a realistic strategy. I have dozens of namespaces across 6-10 clusters, more added every day. I'd like to see a critique of why my solution of sidecaring a special agent w/ ksm is not a valid strategy. As a little side-quest i managed to get this working fairly cleanly with Veneur, however it does not do the transformation of the KSM statistics like your agent does.

I'll be trying this Veneur config today w/ my largest cluster to see if it can hold up under the load caused by the ksms on that cluster (which is still very modestly sized by kube deployment standards).

I'd suggest your engineering team go back to the drawing board with the 'shard ksm per namespace' strategy unless they plan to write an operator to handle that work. Even if they do I imagine that pretty quickly we'd be looking at problems with the amount of load all those KSMs will put on the kube-apiserver.

Again, we don't recommend going the sidecar route, the agent is not designed for this and things might break.

Guess what? It's already broken in your "preferred configuration" based on both your documentation and your helm charts even on modest clusters ( ~75 nodes, ~7500 pods ).

It is very disappointing that there isn't a way to strip the magic out of your agent and providing the ability to essentially opt into any collector i'd like to use with out 1) stamping out a directory 2) modifying your agent sourcecode(?!?).

SleepyBrett commented 6 years ago

Venure holds up against the ksm load on the ~75 node/+7.5k pod cluster without issue (we see occasional request canceled (client timeout) errors from your api endpoint, but retries succeed, no metric continuity errors in our test sub-org). It's not doing the transforms, of course, we are now evaluating the transforms in depth.

DataDog / datadog-agent

Configuration sanity please! #2203