DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.89k stars 1.21k forks source link

Spotty kubernetes event collection #3198

Open andor44 opened 5 years ago

andor44 commented 5 years ago

Output of the info page (if this is a bug)

» k exec -it datadog-5lnhk agent status
Getting the status from the agent.

===============
Agent (v6.10.0)
===============

  Status date: 2019-03-25 13:53:10.729646 UTC
  Pid: 380
  Python Version: 2.7.15
  Logs:
  Check Runners: 4
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -71µs
    System UTC time: 2019-03-25 13:53:10.729646 UTC

  Host Info
  =========
    bootTime: 2019-03-20 10:55:24.000000 UTC
    kernelVersion: 4.19.25-coreos
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 73
    uptime: 58s
    virtualizationRole: guest
    virtualizationSystem: kvm

  Hostnames
  =========
    host_aliases: [redacted]
    hostname: redacted
    socket-fqdn: datadog-5lnhk
    socket-hostname: datadog-5lnhk
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Total Runs: 29,508
      Metric Samples: Last Run: 6, Total: 177,042
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    disk (2.1.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 244, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 123ms

    docker
    ------
      Instance ID: docker [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 317, Total: 1 M
      Events: Last Run: 0, Total: 1,916
      Service Checks: Last Run: 1, Total: 29,507
      Average Execution Time : 48ms

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 5, Total: 147,535
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    io
    --
      Instance ID: io [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 130, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    kubelet (2.4.0)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 437, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 118,025
      Average Execution Time : 427ms

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 377
      Service Checks: Last Run: 5, Total: 225
      Average Execution Time : 100ms

    load
    ----
      Instance ID: load [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 6, Total: 177,042
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    memory
    ------
      Instance ID: memory [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 17, Total: 501,619
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    network (1.9.0)
    ---------------
      Instance ID: network:2a218184ebe03606 [OK]
      Total Runs: 29,508
      Metric Samples: Last Run: 105, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 345ms

    ntp
    ---
      Instance ID: ntp:b4579e02d1981c12 [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 1, Total: 29,507
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 29,507
      Average Execution Time : 0s

    uptime
    ------
      Instance ID: uptime [OK]
      Total Runs: 29,508
      Metric Samples: Last Run: 1, Total: 29,508
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 29,508
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 3,718
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 62,734
    TimeseriesV1: 29,508

  API Keys status
  ===============
    API key ending with e5d88: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - redacted

==========
Logs Agent
==========

  docker
  ------
    Type: docker
    Status: OK
    Inputs: 86865f94527467d22142d81c3fd535d4ba3c824aad2df3f7d1c12ede6b131cb5

=========
Aggregator
=========
  Checks Metric Sample: 39.2 M
  Dogstatsd Metric Sample: 73,768
  Event: 2,294
  Events Flushed: 2,294
  Number Of Flushes: 29,508
  Series Flushed: 33.7 M
  Service Check: 502,328
  Service Checks Flushed: 531,835

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 73,768
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Packet Reading Errors: 0
  Udp Packets: 73,769
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

Describe what happened: Kubernetes event collection breaks shortly after an agent acquires leader lock. If I kill the agent that holds the leader lock another one acquires it, collects evens for a short while (minutes) but then it also stops reporting k8s events. I can seemingly repeat this any number of times.

Weirdly enough we have another cluster where event collection seems to work fine.

You can see in the above agent output it collected 377 events and nothing on later runs.

Describe what you expected: K8s event collection to work reliably

Steps to reproduce the issue: Datadog deployed with official helm chart (values user: here) on k8s 1.12

Additional environment details (Operating System, Cloud provider, etc): CoreOS Container Linux (latest stable), on-premises

andor44 commented 5 years ago

While searching through the issue tracker I bumped into https://github.com/DataDog/datadog-agent/issues/2020 which could be related, except the cluster where we have working event collection is actually the cluster with a higher number of events (though not necessarily more stressed) so I'd expect its events endpoint to be busier/take longer.

That 100ms timeout seems awfully low though. Is there no way to override that without having a config dropin (e.g. env var)? 😞

CharlyF commented 5 years ago

Hey @andor44 - Sorry to hear that you are having this issue.

We are working on improving the Kubernetes event collection as one of the top priority tasks. I am marking this issue so we can track it internally. We will keep you posted as soon as we make progress.

Best, .C

andor44 commented 5 years ago

~I tried enabling the cluster agent. At first it also didn't collect events but after raising CPU limits event collection seems to work with the cluster agent. This leads me to believe the issue I'm experiencing is related to #2020 still.~

~I think it might be worth adding a section about the low request timeout for event collection to the readme and/or to the troubleshooting section.~

EDIT: never mind, scratch all that, I was too quick on the trigger. It exhibits the same behavior: event collection works for a few minues then it stops.

CharlyF commented 5 years ago

Understood, as it uses the same code path I would expect the same behaviour. Thanks for sharing, we will start addressing this week.

Best, .C

mshade commented 5 years ago

Just chiming in that we're also experiencing this issue.

CharlyF commented 5 years ago

@mshade would you mind opening a ticket with our solutions team, we'd like to gather more details to confirm that. Thank you, and sorry for the headache.

Best, .C

mshade commented 5 years ago

Thanks @CharlyF -- we do have a ticket open, but I hadn't seen the github issue, so I just wanted to ping on it as well.

kingpong commented 5 years ago

I have a similar problem with missing kubernetes events, although the agent does not seem to completely stop sending events, but rather just skips the bulk of them. I have opened case 225225.

miskr-instructure commented 8 months ago

In case anybody finds this helpful, here is what Datadog support had to say:

We determined that there was a rate limit applied to kubernetes events for your account back in 2019. Specifically, it was limiting 95% of the kubernetes traffic. This was a standard practice at the time- due to excessively high traffic and billing concerns for customers. Back then, Kubernetes events were considered custom events, and therefore customers were charged for them. Now that Kubernetes events are no longer considered custom events, and my colleagues will remove the limit.

...so if anybody else encounters this, open a support ticket and ask for this rate limit to be removed for your account.