DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.8k stars 1.18k forks source link

http.Server: http: Accept error: request has been rate-limited #4491

Open axot opened 4 years ago

axot commented 4 years ago

Output of the info page (if this is a bug)

# agent status
Getting the status from the agent.

===============
Agent (v6.14.1)
===============

  Status date: 2019-11-25 08:04:57.021342 UTC
  Agent start: 2019-11-05 04:05:44.527832 UTC
  Pid: 1
  Go Version: go1.12.9
  Python Version: 2.7.16
  Check Runners: 4
  Log Level: WARN

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -53µs
    System UTC time: 2019-11-25 08:04:57.021342 UTC

  Host Info
  =========
    bootTime: 2019-11-05 03:11:53.000000 UTC
    kernelVersion: 4.15.0-1034-gke
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 10.1
    procs: 60
    uptime: 53m53s

  Hostnames
  =========
    host_aliases: [gke-xxx-test-app-0-app-red-0-2c62eb0b-jlpf.yyy gke-xxx-test-app-0-app-red-0-2c62eb0b-jlpf-xxx-test-app-0]
    hostname: gke-xxx-test-app-0-app-red-0-2c62eb0b-jlpf.c.yyy.internal
    socket-fqdn: datadog-f49ddd458-6zk92
    socket-hostname: datadog-f49ddd458-6zk92
    host tags:
      gke-xxx-test-app-0-4060f4d1-node
      prd-app
      zone:asia-northeast1-b
      instance-type:n1-standard-64
      internal-hostname:gke-xxx-test-app-0-app-red-0-2c62eb0b-jlpf.c.yyy.internal
      instance-id:5444726203643418295
      project:yyy
      numeric_project_id:824996079768
      cluster-location:asia-northeast1
      gci-ensure-gke-docker:true
      cluster-name:xxx-test-app-0
      gci-update-strategy:update_disabled
      kube-labels:beta.kubernetes.io/fluentd-ds-ready=true,cloud.google.com/gke-nodepool=app-red-0,cloud.google.com/gke-os-distribution=ubuntu,server_group=red
      created-by:projects/824996079768/zones/asia-northeast1-b/instanceGroupManagers/gke-xxx-test-app-0-app-red-0-2c62eb0b-grp
      enable-oslogin:false
      cluster-uid:4060f4d13b73b0d6375b30d37b5c05676cd2eaae63ce5cdcc73ab215d3fa3584
      disable-legacy-endpoints:true
      instance-template:projects/824996079768/global/instanceTemplates/gke-xxx-test-app-0-app-red-0-2c62eb0b
    hostname provider: gce
    unused hostname providers:
      configuration/environment: hostname is empty

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 116,156
      Metric Samples: Last Run: 6, Total: 696,930
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    disk (2.5.0)
    ------------
      Instance ID: disk:89011cceb1f16288 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.yaml
      Total Runs: 116,157
      Metric Samples: Last Run: 200, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 95ms

    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 116,157
      Metric Samples: Last Run: 578, Total: 1 M
      Events: Last Run: 0, Total: 1,909
      Service Checks: Last Run: 1, Total: 116,157
      Average Execution Time : 135ms

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 116,156
      Metric Samples: Last Run: 5, Total: 580,780
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 116,157
      Metric Samples: Last Run: 65, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    kubelet (3.3.2)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 116,156
      Metric Samples: Last Run: 628, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 464,624
      Average Execution Time : 375ms

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 116,157
      Metric Samples: Last Run: 6, Total: 696,942
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 116,156
      Metric Samples: Last Run: 17, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

    network (1.11.4)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 116,157
      Metric Samples: Last Run: 79, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 2ms

    ntp
    ---
      Instance ID: ntp:7e1f812862c38a66 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.yaml
      Total Runs: 116,157
      Metric Samples: Last Run: 1, Total: 116,157
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 116,157
      Average Execution Time : 0s

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 116,157
      Metric Samples: Last Run: 1, Total: 116,157
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 116,156
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 10,447
    Metadata: 0
    Requeued: 1
    Retried: 1
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 242,759
    TimeseriesV1: 116,156

  Transaction Errors
  ==================
    Total number: 1
    Errors By Type:

  API Keys status
  ===============
    API key ending with 2c40c: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 2c40c

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 176.4 M
  Dogstatsd Metric Sample: 21.7 M
  Event: 1,910
  Events Flushed: 1,910
  Number Of Flushes: 116,156
  Series Flushed: 168.1 M
  Service Check: 1.8 M
  Service Checks Flushed: 1.9 M

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 21.7 M
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 1.9 G
  Udp Packet Reading Errors: 0
  Udp Packets: 21.7 M
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

Describe what happened: datadog agent showed http: Accept error: request has been rate-limited, and then we can not see any apm events.

Describe what you expected: We want to know the root cause and how to solve it.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc): datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:07:52.624988201Z 2019-11-25 06:07:52 UTC | TRACE | WARN | (pkg/trace/info/stats.go:111 in LogStats) | [lang:php lang_version:7.2.24 interpreter:fpm-fcgi tracer_version:0.33.0] -> traces_dropped(decoding_error:55). Enable debug logging for more details. datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:07:52.625030217Z 2019-11-25 06:07:52 UTC | TRACE | WARN | (pkg/trace/api/api.go:518 in watchdog) | CPU threshold exceeded (apm_config.max_cpu_percent: 50): 1 datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:02.624100679Z 2019-11-25 06:08:02 UTC | TRACE | WARN | (pkg/trace/api/api.go:518 in watchdog) | CPU threshold exceeded (apm_config.max_cpu_percent: 50): 1 datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.297445136Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 5ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.302707184Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 10ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.313077695Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 20ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.333479258Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 40ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.373653815Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 80ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.454073332Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 160ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.614451933Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 320ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:07.934798176Z 2019-11-25 06:08:07 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 640ms datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:08.575108959Z 2019-11-25 06:08:08 UTC | TRACE | ERROR | (pkg/trace/agent/log.go:65 in ReceiveMessage) | Too many messages to log, skipping for a bit... datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:12.578466285Z 2019-11-25 06:08:12 UTC | TRACE | ERROR | (pkg/trace/api/api.go:616 in Write) | http.Server: http: Accept error: request has been rate-limited; retrying in 1s datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:13.579612403Z 2019-11-25 06:08:13 UTC | TRACE | ERROR | (pkg/trace/api/api.go:372 in handleTraces) | Cannot decode v0.4 traces payload: unexpected EOF datadog-f49ddd458-6zk92 trace-agent 2019-11-25T06:08:13.579646122Z 2019-11-25 06:08:13 UTC | TRACE | ERROR | (pkg/trace/api/api.go:372 in handleTraces) | Cannot decode v0.4 traces payload: unexpected EOF

unacceptable commented 4 years ago

@axot was this datadog limiting you? We are running into something similar now.

unacceptable commented 4 years ago

https://docs.datadoghq.com/tracing/troubleshooting/?tab=java#max-connection-limit This is something that I found that might help you out.