k8s w/ ksm integration issues

SleepyBrett commented 6 years ago

Output of the info page (if this is a bug)

(Paste the output of the info page here)

Describe what happened: K8S 1.9.6 (though I don't think it matters)

I have datadog deployed as a daemonset (using your stable chart) in the same namespace as my current monitoring stack (prometheus 2.x + node exporter + ksm + ...). I have the KSM integration enabled in the chart.

All the pods come up fine though two of them have significantly higher cpu usage and crash a lot. They seem to be getting liveness killed at a very high rate (2200+ crashes over 14 days). It just so happens that those two pods are on the same nodes as the two ksm pods (one mine one yours).

So I'll probably move DD to it's own namespace though I'm not sure that will resolve the discovery of the ksm i want it to ignore.

So my question is, how are you doing KSM discovery? And maybe it makes more sense if your KSM scraper can't handle this size of cluster (50ish m4.10x nodes, 2500ish services) to deploy a separate agent from the daemonset specifically for scraping that KSM (maybe co-habitated in the same pod) so that I don't lose other node specific metrics due to the high rate of crashing.

Describe what you expected: DD k8s w/ KSM to behave.

Steps to reproduce the issue: Build a reasonably sized cluster and try to run dd w/ ksm

Additional environment details (Operating System, Cloud provider, etc): coreos, aws, k8s 1.9.6

xvello commented 6 years ago

hi @SleepyBrett ,

First, please note that you can disable the ksm installed by our chart by setting kubeStateMetrics.enabled to false, the agent will autodetect your existing instance and pull metrics from it.

As for the ksm-monitoring agents, could you please:

make sure you are running 6.2 or later, as we significantly improved the performance of this check
increase the number of check runners via the DD_CHECK_RUNNERS envvars, as documented here?

If the issue persists, we'll need a full debug agent flare sent to our support team to investigate further.

Cheers

SleepyBrett commented 6 years ago

Ok so i can deploy one less ksm as long as I continue to run it in my namespace, but that doesn't really answer my question "How do you do KSM service discovery?" What if my prometheus KSM is configured differently than the one you deploy? You should be doing discovery by labels and apply datadog ownership labels to your own KSM.

We are running 6.2, your performance on the node w/ ksm is still abysmal and I fear that it because of the high amount of self-crashes that I'm losing other metrics on that node.

DD_CHECK_RUNNERS: the agent runs all checks in sequence by default (default value = 1 runner). If you need to run a high number of checks (or slow checks) the collector-queue component might fall behind and fail the healthcheck. You can increase the number of runners to run checks in parallel

What exactly does a check runner do? This documentation doesn't actually explain whats happening or what tradeoffs I'm making when I increase this value. Wait .. https://github.com/DataDog/datadog-agent/wiki/Python-Check-Runner

So this is going to cause more resource utilization on ALL pods just so one of them can keep up with KSM.. this seems like a bad solution to a bad design. Why can't your runner realize that it's falling behind on a given node and scale it's own additional runner? Or better yet, realize that KSM is a big hunk of metrics on any cluster that isn't tiny and sidecar a special agent into that pod to run separate from the main agent daemonset?

xvello commented 6 years ago

Hi @SleepyBrett

Our autodiscovery process is documented at https://docs.datadoghq.com/agent/autodiscovery/ , and works across all namespaces, by locally querying the kubelet. Our support team will be happy to help you with setting it up for your cluster's specifics. We are currently working on improving our check scheduling logic in the next release to avoid relying on a fixed number of runners, in the meantime raising the runner count is our documented mitigation strategy.

SleepyBrett commented 6 years ago

ad_identifiers:
  - kube-state-metrics

So wait you are just essentially checking for docker container names that contain 'kube-state-metrics'? and then hitting them on - kube_state_url: http://%%host%%:8080/metrics ...

I run a multi-tenant cluster, I can't guarantee that other teams wont run their own kube-state-metrics for their own purposes and pollute the data.

I suggest any of the following approaches:

1) a special agent sidecarred into KSM reading localhost:8080/metrics

2) using the kube api to find your own KSM using it's servicename or labeling and doing a leader election to determine who is going to scrape it

3) if ad-indentifer can take a regex at least expanding that default to contain the namespace that your chart is deployed in.

Of all those options 1 still seems the most sensible. It works around a number of problems. It allows for one special agent that will allow tweaking for resources and scheduling, it guarantees that other collectors will not be interrupted by overloading the primary node agent, and it means you wouldn't have to do autodiscovery since ksm will be on localhost with the special agent.

It seems like I can probably implement this workaround myself.

CharlyF commented 6 years ago

Hey @SleepyBrett,

I just wanted to weigh in on this thread - Apologies for the slight delay!

You are correct, in this case, we will be looking for pods running a container image called kube-state-metrics in order to automatically schedule the kube-state-metrics (KSM) check on each agent running on a node where the aforementioned pod is.

As you probably saw in the doc @xvello mentioned, this is one of the several processes used to do Autodiscovery (relying on annotations, on KV stores).

I understand that given your environment this may not be the desired behavior, luckily it is possible to disable it via a config map.

This default behavior is enabled via the file backend for Autodiscovery. You can see the configuration at conf.d/kubernetes_state.d/auto_conf.yaml or here on github at: https://github.com/DataDog/integrations-core/blob/master/kubernetes_state/auto_conf.yaml

This file defines the configuration of the check, as well as the identifiers to use for discovery of containers/pods the integration should monitor.

You can disable this behavior by adding a config map to the datadog-agent which replaces the auto_conf.yaml file at /conf.d/kubernetes_state.d/auto_conf.yaml with an empty one. This will stop, the KSM integration from being loaded automatically.

You can then optionally enable the integration for the KSM instances you do want to monitor by adding the following annotations to their manifest:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
  annotations:
    ad.datadoghq.com/kube-state-metrics.check_names: '["kube-state-metrics"]'
    ad.datadoghq.com/kube-state-metrics.init_configs: '[{}]'
    ad.datadoghq.com/kube-state-metrics.instances: '[{"kube_state_url": "http://%%host%%:8080/metrics"}]'
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.3.1
        ports:
        - name: http-metrics
          containerPort: 8080
[...]

You can find more details on file based and annotation based autodiscovery in our documentation at: https://docs.datadoghq.com/agent/autodiscovery/#template-source-kubernetes-pod-annotations

That being said as @xvello mentioned Datadog does not require our own instance of Kube-State-Metrics. This is an optional dependency and we’re happy to monitor the existing instances on your cluster.

In addition to this we wanted to share a bit more about the performance challenges being seen with KSM in your cluster, since there have been a few issues on this topic.

There are presently two bottlenecks with the kube-state-metrics check which are impacting your environment:

1/ Payload size in large clusters

The size of the output generated when hitting the /metrics grows with the size of the cluster being monitored. In large clusters can result in 50MB - 100MB+ of output to parse on each check run. We collect this data every 15 seconds, and the agent at times is not able to parse this in a timely manner, which is why you see the liveness probe killing the agent.

There are a few ways to address this:

Namespace the KSM by resource (pods, daemonsets ...) and deploy several replicas (covering all the resources you care about). The KSM check would be run by several agents but each agent would collect a smaller subset of the overall metrics, allowing the data to be returned and parsed in a timely manner. There are some flags and options that can be set to do so (happy to help out if it's of interest).
Optimizing DD_CHECK_RUNNERS DD_CHECK_RUNNERS tells the agent how many parallel goroutines to use to run checks. Setting DD_CHECK_RUNNERS to 0 will instruct the agents to identify and set the optimal number of runners based on the collector queue.

So even if you set this in your DaemonSet, only the agent running the KSM check will have an increased number of runners.

Remove the liveness probe (optionally).
While we do not recommend doing so, removing the probe will prevent the agent from being restarted if a check is taking too long.

We agree that the performance here is not ideal and are working on a number of solutions to upstream kube-state-metrics and prometheus_client to allow them to scale as your clusters grow.

2/ Processing limitations of the upstream library

One of the most recent improvements we’ve made was to contribute a 4.5x performance improvement on the standard python prometheus-client library. You can find the PR at: https://github.com/prometheus/client_python/pull/282

I hope this helps provide a better understanding of the behavior you are seeing and offers some insight into our plans moving forward. We would be happy to discuss any ideas or suggestions you may have on how to further improve the experience.

Thank you for your patience and feedback on the check and integration.

Best, .C

SleepyBrett commented 6 years ago

The size of the output generated when hitting the /metrics grows with the size of the cluster being monitored. In large clusters can result in 50MB - 100MB+ of output to parse on each check run. We collect this data every 15 seconds, and the agent at times is not able to parse this in a timely manner, which is why you see the liveness probe killing the agent.

This seems to be a problem with your agent. As a test I spun up a pod with ksm+venuer-prometheus+venuer and it can parse and ship it without fail (~75nodes, ~7500pods), not to mention my cluster-local prometheus is scraping it quite happily. You do a significant amount of munging the data gathered from KSM, I'd suggest that the proper direction for you to move in is to create your own "KSM" that does all the transforms that you need and ship directly from that pod.

endzyme commented 5 years ago

I believe we're hitting the same issue with out cluster. Is there a definitive way to test that the /metrics payload is "too large" and could cause the issue which results in liveness probe killing the container?

Our basic symptoms are that any datadog agent living on the same node as the datadog-kube-state-metrics instance has lots of restarts which result in gaps in data collection of kube-state-metrics.

CharlyF commented 5 years ago

@endzyme thank you for your report and apologies for the headache.

We still have this on our roadmap, it has not been something easy to fix - We decided against forking KSM (or re-writing our own version of it).

The current workaround solution is to have one agent, with higher memory/cpu specs, deployed as a sidecar of KSM and solely run the KSM check, as suggested earlier in this thread.

If this is still not enough we can split up the KSM deployment into 2 collectors (one with pod level data and the other one with the rest) and the agent runs the KSM checks at different frequencies in order to provide a stable and consistent collection.

While these are workarounds, we are planning on having a stable solution that would benefit from local metrics collected from the kubelet and cluster level metrics collected from KSM, in order to eliminate as much as possible the overlap that exists between them. We are also planning on discussing with the team working on KSM to best integrate with them and offer a better experience to our users. In the meantime, we do thank you for your patience and should you be interested in one of the aforementioned workarounds, let me know, I'd be happy to share manifests.

Best, .C

endzyme commented 5 years ago

Thanks @CharlyF for the workarounds! These will be helpful as we scale.

I am thinking of contributing to the existing datadog agent helm chart, to allow for a separate agent deploy specifically bound to kube-state-metrics collection. This would enable people to manage different resource constraints for the "normal" agents vs the one that collects kube-state-metrics. Is that something that would be advantageous to datadog helm chart users? Or would it be better to wait for the root cause fix?

CharlyF commented 5 years ago

Of course! With regards to the contribution, we are assessing how to best introduce this feature so it does not cause retro-compatibility issues as we introduce a stable solution later on. We are currently working on it and we should be meeting up with the KSM team next week at Kubecon. I'm setting up a reminder to keep you posted here.

endzyme commented 5 years ago

Thanks for the heads up - maybe we'll run into each other at kubecon. Sounds like I should hold off on contributing until you all can come up with a game plan. We appreciate you all digging into this!

andor44 commented 5 years ago

We are running into this as well. Our clusters are growing to sizes where the output of http://<ksm>/metrics is in the tens of MBs, and presumably DD is taking too long to process it, getting healthcheck-killed in the process.

I discovered something that can help a lot to alleviate this issue until a more final fix is found:

Remove CPU limit on KSM. k8s uses CFS quotas to enforce CPU limits. Currently it has a hardcoded 100ms accounting period (a PR has been merged to make this configurable but it's still behind an alpha feature gate). This can hurt latency-sensitive applications such as KSM quite hard, as it will quickly blow through its CPU quota and get throttled. In our case a request to its /metrics was taking 5-6 seconds due to CPU throttling. Removing CPU limits cut that to 4-500 ms, so a full order of magnitude faster.

SleepyBrett commented 5 years ago

I think a good easy solution for this and other kubernetes issues is to create a stripped down agent that ONLY does checks, ship it with the checks, all off, allow us to turn them on. This container should not perform any other checks including built in system checks. Also it should be set up to not run as root (a huge problem with your other images).

In this way we could sidecar that agent onto ksm, customers on our cluster could sidecar onto their nginx pods and whatnot to get your integration.

Those agents should also be set up just to emit statsd (or be able to be configured to do that) with configuration for statsd target host and port.

andor44 commented 5 years ago

FYI KSM 1.5 was released a couple weeks ago. KSM's performance was massively improved in https://github.com/kubernetes/kube-state-metrics/issues/498 so that should help with the above issue too. In our case it was an almost 10x improvement in /metrics!

naseemkullah commented 5 years ago

So I followed the advice here (no cpu limits on v1.5.0 ksm) but the collocated datadog-agent pod still restarts non stop due to OOM errors, I do not want to increase memory requests just cause one pod of the DS is restarting though. Any suggestions?

Logs say:

[ TRACE ] 2019-02-19 16:10:23 INFO (stats.go:265) - flushed stat payload; url: https://trace.agent.datadoghq.com, time:599.573069ms, size:4141 bytes
[ TRACE ] 2019-02-19 16:10:23 INFO (trace.go:102) - flushed trace payload to the API, time:596.823858ms, size:15513 bytes
[ AGENT ] AGENT EXITED WITH CODE 256, SIGNAL 9, KILLING CONTAINER
[ TRACE ] 2019-02-19 16:10:30 INFO (main.go:35) - received signal 15 (terminated)
[ TRACE ] 2019-02-19 16:10:30 INFO (agent.go:140) - exiting

endzyme commented 5 years ago

So I followed the advice here (no cpu limits on v1.5.0 ksm) but the collocated datadog-agent pod still restarts non stop due to OOM errors, I do not want to increase memory requests just cause one pod of the DS is restarting though. Any suggestions?

Logs say:
[ TRACE ] 2019-02-19 16:10:23 INFO (stats.go:265) - flushed stat payload; url: https://trace.agent.datadoghq.com, time:599.573069ms, size:4141 bytes
[ TRACE ] 2019-02-19 16:10:23 INFO (trace.go:102) - flushed trace payload to the API, time:596.823858ms, size:15513 bytes
[ AGENT ] AGENT EXITED WITH CODE 256, SIGNAL 9, KILLING CONTAINER
[ TRACE ] 2019-02-19 16:10:30 INFO (main.go:35) - received signal 15 (terminated)
[ TRACE ] 2019-02-19 16:10:30 INFO (agent.go:140) - exiting

I think it's been suggested a few times above to try deploying a separate set of agents specifically for kube-state-metrics collection. That way you can manage those resources differently than your normal agents for metrics and statsd collection.

naseemkullah commented 5 years ago

Hmm seeing that no proper inclusion of this approach is in the helm chart, I opted to just increase the memory limit of the agent ds to allow the one that interacts with ksm to consume more memory, so far it yields good results, e.g. 0 restarts.

hkaj commented 5 years ago

Hi all,

Quick update on this issue. With the cluster agent, it is now possible to run the kubernetes_state check as a cluster-level check (more info here).

The main advantage of this is that cluster check can be run by a separate agent deployment: https://github.com/DataDog/datadog-agent/blob/6.12.2/Dockerfiles/manifests/agent-clusterchecks-only.yaml - which can be sized appropriately, with 2 or 3 replicas limited to a few GB of memory usage for example. This is especially useful for checks like KSM that are resource intensive, and create imbalance in the agent daemonset workload.

Since kubernetes_state is autodiscovered, you will also need to mount an empty file in place of the autodiscovery template in the agent daemonset to avoid a normal agent running the check as well. That template is at /etc/datadog-agent/conf.d/kubernetes_state.d/auto_conf.yaml in the agent container.

This is what we are using in production for kubernetes_state checks in large clusters (the check needs more memory than we recommend allocating in large clusters), and it works very well, allowing us to reduce memory requests and limits for the agent daemonset, making the placement of its pods easier.

We are working on describing this process more precisely in the documentation, please reach out on slack or to support if you need more details in the meantime.

kivagant-ba commented 5 years ago

Got OOM Killed with 256Mi memory limit:

datadog-8cpvf                                 0/1       6          7m51s 
datadog-4ht7t                                 0/1     Pending            0          64m
datadog-8cpvf                                 0/1     OOMKilled   7          15m
datadog-cluster-agent-66cf77477c-bz9hl        1/1     Running            0          29m
datadog-cluster-agent-66cf77477c-v62dj        1/1     Running            0          17m
datadog-kube-state-metrics-58c9f87548-cp5c4   1/1     Running            0          29m

Image: datadog/agent:6.13.0

Deployed from https://github.com/helm/charts/tree/master/stable/datadog

Only one pod crashes constantly.

2019-08-06 14:51:26 UTC | CORE | INFO | (pkg/collector/runner/runner.go:263 in work) | Running check disk
2019-08-06 14:51:27 UTC | CORE | INFO | (pkg/collector/runner/runner.go:329 in work) | Done running check disk
2019-08-06 14:51:27 UTC | CORE | INFO | (pkg/collector/runner/runner.go:263 in work) | Running check kubernetes_state
AGENT EXITED WITH CODE 256, SIGNAL 9, KILLING CONTAINER
2019-08-06 14:23:13 UTC | TRACE | INFO | (main.go:23 in handleSignal) | received signal 15 (terminated)
2019-08-06 14:23:13 UTC | TRACE | INFO | (pkg/trace/agent/agent.go:127 in loop) | Exiting...

    Limits:
      ephemeral-storage:  512Mi
      memory:             256Mi
    Requests:
      cpu:                200m
      ephemeral-storage:  100Mi
      memory:             256Mi

I guess this happened because priorityClassName hadn't been set and the node had a lot of pods after cluster rollout. I updated it to system-node-critical for daemonset and system-cluster-critical for clusterAgent.

UP: I found that the chart has the lines in values.yaml:

  # confd:
  #   redisdb.yaml: |-
  #     init_config:
  #     instances:
  #       - host: "name"
  #         port: "6379"
  #   kubernetes_state.yaml: |-
  #     ad_identifiers:
  #       - kube-state-metrics
  #     init_config:
  #     instances:
  #       - kube_state_url: http://%%host%%:8080/metrics

So according to the comment I replaced them with

  confd:
    kubernetes_state.yaml: ""

UP2: even after the empty file was configured one pod still crashes with the same error message after Running check kubernetes_state.

@hkaj could you help with the issue, please? I can't find how the check can be disabled.

UP2: Shame on me, I found that clusterchecksDeployment.enabled setting was unintentionally disabled after the chart update from upstream.

UP3: Finally fixed (many thanks to @hkaj for the help) by adding these settings into the chart values:

datadog:
  volumes:
    - name: empty-dir
      emptyDir: {}
  volumeMounts:
    - name: empty-dir
      mountPath: /etc/datadog-agent/conf.d/kubernetes_state.d
      readOnly: true

hkaj commented 5 years ago

Hi @kivagant-ba - the issue here is that the agent running the kubernetes_state needs more memory than that to complete it. This is due to the amount of metrics that ksm exposes (which is both a good thing for observability, and a not-so-good thing for resource usage 😄 ). The two solutions you have are:

bump the memory limit
use a dedicated deployment of agent(s) to run the kubernetes_state check, and potentially other heavy checks

What I would consider if I were you is the overall memory usage of your cluster, i.e. if you have 3 nodes and need to add 64MB of RAM to the daemonset to not OOM, that's 192 MB total. Less than what a separate agent deployment would need. If you have a larger cluster, or need more RAM to not OOM, the side deployment is better. Here's the PR for the docs on how to do that btw: https://github.com/DataDog/documentation/pull/5013 - it's very early stage, but the instructions are correct.

kivagant-ba commented 5 years ago

@hkaj , the documentation link really helped! I updated original message to collect everything together.

DataDog / datadog-agent

k8s w/ ksm integration issues #1853