DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.21k forks source link

Kubernetes agent large number of ksm timeouts #1665

Closed SleepyBrett closed 6 years ago

SleepyBrett commented 6 years ago

Output of the info page (if this is a bug)

kcm exec -it dd-prod-datadog-4zfjf s6-svstat /var/run/s6/services/agent/
up (pid 343) 941 seconds

kcm exec -it dd-prod-datadog-4zfjf agent status
etting the status from the agent.

==============
Agent (v6.1.4)
==============

  Status date: 2018-04-30 21:45:23.534591 UTC
  Pid: 343
  Python Version: 2.7.14
  Logs:
  Check Runners: 1
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2018-04-30 21:45:23.534591 UTC

  Host Info
  =========
    bootTime: 2018-04-22 21:30:38.000000 UTC
    kernelVersion: 4.14.30-coreos-r1
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.4
    procs: 74
    uptime: 690953
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    ec2-hostname: ip-172-16-207-198.us-west-2.compute.internal
    hostname: i-04d28d7bc22b17564
    instance-id: i-04d28d7bc22b17564
    socket-fqdn: dd-prod-datadog-4zfjf
    socket-hostname: dd-prod-datadog-4zfjf

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 57
      Metrics: 6, Total Metrics: 336
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    disk
    ----
      Total Runs: 57
      Metrics: 162, Total Metrics: 9234
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    docker
    ------
      Total Runs: 57
      Metrics: 1695, Total Metrics: 96605
      Events: 0, Total Events: 21
      Service Checks: 1, Total Service Checks: 57

    file_handle
    -----------
      Total Runs: 57
      Metrics: 1, Total Metrics: 57
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    io
    --
      Total Runs: 57
      Metrics: 247, Total Metrics: 13908
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    kube_dns
    --------
      Total Runs: 56
      Metrics: 81, Total Metrics: 4536
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    kubelet
    -------
      Total Runs: 56
      Metrics: 5363, Total Metrics: over 100K
      Events: 0, Total Events: 0
      Service Checks: 3, Total Service Checks: 168

    kubernetes_state
    ----------------
      Total Runs: 56
      Metrics: 77021, Total Metrics: over 1M
      Events: 0, Total Events: 0
      Service Checks: 5178, Total Service Checks: over 100K

    load
    ----
      Total Runs: 56
      Metrics: 6, Total Metrics: 336
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    memory
    ------
      Total Runs: 56
      Metrics: 14, Total Metrics: 784
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    network
    -------
      Total Runs: 56
      Metrics: 440, Total Metrics: 24640
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    ntp
    ---
      Total Runs: 56
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 56

    uptime
    ------
      Total Runs: 56
      Metrics: 1, Total Metrics: 56
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 75
  IntakeV1: 23
  RetryQueueSize: 0
  Success: 609
  TimeseriesV1: 75

  API Keys status
  ===============
    https://6-1-4-app.agent.datadoghq.com,*************************2c443: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 2.384761e+06
  Event: 22
  Events Flushed: 22
  Number Of Flushes: 75
  Series Flushed: 1.957371e+06
  Service Check: 130554
  Service Checks Flushed: 127351
  Dogstatsd Metric Sample: 188

Describe what happened:

[ AGENT ] 2018-04-30 21:33:09 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]

I run a moderately large cluster of 46 nodes (37 true workers, m4.10xl, about 2900 pods ). I've installed your agent (6.1.4) using your stable helm chart as a ds. I've removed all resource limits (assuming at first that it was cpu throttling) and that got it to be able to scape metrics some of the time, but I'm seeing large numbers of timeout errors suggesting that we might be on the edge.

When I get into another container and curl the endpoint it responds within about a second, curl timing log:

root@util-934915283-7f2xc:/# curl http://25.128.74.214:8080/metrics -w "@format" -o /dev/null -s

            time_namelookup:  0.000
               time_connect:  0.002
            time_appconnect:  0.000
           time_pretransfer:  0.002
              time_redirect:  0.000
         time_starttransfer:  0.847
                            ----------
                 time_total:  0.901

Describe what you expected: I expect ksm metrics to find their way to dd, on my largest cluster this has been problematic. I expect there may be a env variable or config parameter to tweak this timeout, but I can't find it in the documentation.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc): Kubernetes 1.9.6 on aws, approx 2900 pods on m4.10xl nodes

full log from dd-agent on the same node as ksm:

[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 01-check-apikey.sh: executing...
[cont-init.d] 01-check-apikey.sh: exited 0.
[cont-init.d] 10-dogstatsd-socket.sh: executing...
[cont-init.d] 10-dogstatsd-socket.sh: exited 0.
[cont-init.d] 50-ecs.sh: executing...
[cont-init.d] 50-ecs.sh: exited 0.
[cont-init.d] 50-kubernetes.sh: executing...
Disabling the apiserver check as leader election is disabled
[cont-init.d] 50-kubernetes.sh: exited 0.
[cont-init.d] 50-mesos.sh: executing...
[cont-init.d] 50-mesos.sh: exited 0.
[cont-init.d] 51-docker.sh: executing...
[cont-init.d] 51-docker.sh: exited 0.
[cont-init.d] 59-defaults.sh: executing...
[cont-init.d] 59-defaults.sh: exited 0.
[cont-init.d] 60-network-check.sh: executing...
[cont-init.d] 60-network-check.sh: exited 0.
[cont-init.d] 89-copy-customfiles.sh: executing...
[cont-init.d] 89-copy-customfiles.sh: exited 0.
[cont-init.d] done.
[services.d] starting services
[PROCESS] starting process-agent
[ TRACE ] starting trace-agent
[ AGENT ] starting agent
[services.d] done.
[PROCESS] 2018-04-30 21:26:28 INFO (tagger.go:78) - starting the tagging system
[PROCESS] 2018-04-30 21:26:28 INFO (tagger.go:147) - kubelet tag collector successfully started
[ TRACE ] 2018-04-30 21:26:28 INFO (main.go:175) - trace-agent not enabled.
[ TRACE ] Set env var DD_APM_ENABLED=true or add
[ TRACE ] apm_enabled: true
[ TRACE ] to your datadog.conf file.
[ TRACE ] Exiting.
[PROCESS] 2018-04-30 21:26:29 INFO (tagger.go:147) - kube-service-collector tag collector successfully started
[PROCESS] 2018-04-30 21:26:29 INFO (tagger.go:147) - docker tag collector successfully started
[PROCESS] 2018-04-30 21:26:30 INFO (config.go:333) - overriding API key from env DD_API_KEY value
[ AGENT ] 2018-04-30 21:26:30 UTC | WARN | (file.go:73 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
[ AGENT ] 2018-04-30 21:26:30 UTC | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-04-30 21:26:30 UTC | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-04-30 21:26:30 UTC | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (disk).
[ AGENT ] 2018-04-30 21:26:31 UTC | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-04-30 21:26:31 UTC | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-04-30 21:26:31 UTC | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (kubelet).
[ AGENT ] 2018-04-30 21:26:31 UTC | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-04-30 21:26:31 UTC | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-04-30 21:26:31 UTC | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (network).
[ TRACE ] trace-agent exited with code 0, disabling
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-sctrsptsys-nonprod-5946c9dd94-bw5nv
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (configresolver.go:290 in getHost) | network  not found, trying bridge IP instead
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (kubernetes_state).
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-sctrsptsys-nonprod-5946c9dd94-bw5nv
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-dscustomercommsvc-production-646d99d6cc-9cnlf
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-dscustomercommsvc-production-646d99d6cc-9cnlf
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kube-flannel-p9hpm
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-sctransfersvc-nonprod-754d96bb66-s7gcl
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-sctransfersvc-nonprod-754d96bb66-s7gcl
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod filebeat-9kmn2
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-scawarehousemgmt-nonprod-5d5756c9f4-8tchs
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-scawarehousemgmt-nonprod-5d5756c9f4-8tchs
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod dev-experiments-service-787994dc6f-lb9q9
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod prod-5c656c8b89-jz4p5
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-merchrib-production-57448f4f7c-wfg6n
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-merchrib-production-57448f4f7c-wfg6n
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-ddosreviews-production-5b8689d7bd-7csph
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-ddosreviews-production-5b8689d7bd-7csph
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod prod-experiments-mongodb-2
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-reserve-production-6dccf745b-l9vt9
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-reserve-production-6dccf745b-l9vt9
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod sysdig-84mbm
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-ecommcheckout-production-7bb86f5c77-x5n2k
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-ecommcheckout-production-7bb86f5c77-x5n2k
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-runway-production-7454885f45-c76jm
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-runway-production-7454885f45-c76jm
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod payments-vantiv-authorize-service-68546ddc8f-pxgbl
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod payments-vantiv-authorize-service-68546ddc8f-pxgbl
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod journalbeat-4cp9n
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-ese-production-7b9c8678d5-8sfg6
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-ese-production-7b9c8678d5-8sfg6
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-sccpromise-nonprod-6c658bbbc4-dwnzm
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-sccpromise-nonprod-6c658bbbc4-dwnzm
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-scrouting-nonprod-57b4f6b596-jm9qs
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-scrouting-nonprod-57b4f6b596-jm9qs
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-tds-production-655c8d9b7c-cj2sq
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-tds-production-655c8d9b7c-cj2sq
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod metrics-router-prod-1069569284-rdprj
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-drproduct-production-7cf95d65-s94ql
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-drproduct-production-7cf95d65-s94ql
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (configresolver.go:290 in getHost) | network  not found, trying bridge IP instead
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (configresolver.go:290 in getHost) | network  not found, trying bridge IP instead
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-04-30 21:26:45 UTC | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (kube_dns).
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kube-proxy-8xp84
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod node-problem-detector-62vsv
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-datalens-production-7fd5f78b9-7c969
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-datalens-production-7fd5f78b9-7c969
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-mriaudit-production-b8fb4b7fb-h9z97
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod kandi-mriaudit-production-b8fb4b7fb-h9z97
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod dev-experiments-mongodb-2
[ AGENT ] 2018-04-30 21:26:45 UTC | ERROR | (kubelet.go:170 in createService) | Failed to get ports for pod nginx-4217019353-lsz8k
[ AGENT ] 2018-04-30 21:26:46 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (__init__.py:189) | DEPRECATION NOTICE: `device_name` is deprecated, please use a `device:` tag in the `tags` list instead
[ AGENT ] 2018-04-30 21:26:50 UTC | ERROR | (docker_main.go:118 in fetchForDockerID) | Failed to inspect container 7f5d290984628c1c8736f7d3c0bfa921881ebb50caf96e5760d1cac811acf303 - Error: No such container: 7f5d290984628c1c8736f7d3c0bfa921881ebb50caf96e5760d1cac811acf303
[ AGENT ] 2018-04-30 21:26:50 UTC | WARN | (tagger.go:245 in Tag) | error collecting from docker: Error: No such container: 7f5d290984628c1c8736f7d3c0bfa921881ebb50caf96e5760d1cac811acf303
[ AGENT ] 2018-04-30 21:26:50 UTC | ERROR | (docker_main.go:118 in fetchForDockerID) | Failed to inspect container de0a34c3c65b281586e98a970cd210365e6e85e8f57a1e4af122960646bd482a - Error: No such container: de0a34c3c65b281586e98a970cd210365e6e85e8f57a1e4af122960646bd482a
[ AGENT ] 2018-04-30 21:26:50 UTC | WARN | (tagger.go:245 in Tag) | error collecting from docker: Error: No such container: de0a34c3c65b281586e98a970cd210365e6e85e8f57a1e4af122960646bd482a
[ AGENT ] 2018-04-30 21:26:50 UTC | ERROR | (docker_main.go:118 in fetchForDockerID) | Failed to inspect container c74ca97bc02d8f7137391b568dc9963a49a51c00b097170e5fcf73bd97f4a54a - Error: No such container: c74ca97bc02d8f7137391b568dc9963a49a51c00b097170e5fcf73bd97f4a54a
[ AGENT ] 2018-04-30 21:26:50 UTC | WARN | (tagger.go:245 in Tag) | error collecting from docker: Error: No such container: c74ca97bc02d8f7137391b568dc9963a49a51c00b097170e5fcf73bd97f4a54a
[ AGENT ] 2018-04-30 21:26:57 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (__init__.py:189) | DEPRECATION NOTICE: `device_name` is deprecated, please use a `device:` tag in the `tags` list instead
[ AGENT ] 2018-04-30 21:28:26 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:28:37 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:29:46 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:29:57 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:31:09 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:31:20 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:31:31 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
[ AGENT ] 2018-04-30 21:33:09 UTC | ERROR | (runner.go:276 in work) | Error running check kubernetes_state: [{"message": "HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/bin/agent/dist/checks/__init__.py\", line 332, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes_state/kubernetes_state.py\", line 196, in check\n    self.process(endpoint, send_histograms_buckets=send_buckets, instance=instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 352, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 316, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/prometheus/mixins.py\", line 472, in poll\n    response = requests.get(endpoint, headers=headers, stream=True, timeout=1, cert=cert, verify=verify)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 72, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/api.py\", line 58, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 508, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/sessions.py\", line 618, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/adapters.py\", line 521, in send\n    raise ReadTimeout(e, request=request)\nReadTimeout: HTTPConnectionPool(host='25.128.74.214', port=8080): Read timed out. (read timeout=1)\n"}]
CharlyF commented 6 years ago

Hey @SleepyBrett, thank you for opening this. We have identified this issue recently and fixed it - The problem was a timeout that was too short. This fix https://github.com/DataDog/integrations-core/pull/1399 will be embedded in the next version of the agent (6.2) that will be released within 2 weeks.

SleepyBrett commented 6 years ago

Two weeks seems like a long time to go without any kube state metrics.

CharlyF commented 6 years ago

We have a release cycle of 6 weeks for the agent version 6. I apologise if that causes any issues on your end. As we are currently in the QA phase of the release, you can temporarily use the release candidate. datadog/agent:6.2.0-rc.1 that embeds this fix, or the even more recent ( datadog/agent-dev:6-2-0-rc-2). I hope that can be a solution for you.

SleepyBrett commented 6 years ago

Respectfully, bug fixes aren't features. I shouldn't have to swallow untested pre-release features to get a bugfix.

mfpierre commented 6 years ago

Closing this as it's fixed in 6.2.0 that was released a few days ago.