DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.21k forks source link

[BUG] Error removing auto-discovered containers resulting in all further docker events being ignored #15170

Closed far-blue closed 1 year ago

far-blue commented 1 year ago

I am regularly seeing errors during the handling of container removal events which then leads to no further docker events being processed (e.g. no logs are gathered for new containers) until the dd agent is restarted.

Agent Environment Example logs around the error:

2023-01-19 11:49:00 GMT | CORE | INFO | (pkg/serializer/serializer.go:403 in sendMetadata) | Sent metadata payload, size (raw/compressed): 1797/598 bytes.
2023-01-19 11:49:00 GMT | CORE | INFO | (pkg/serializer/serializer.go:427 in SendProcessesMetadata) | Sent processes metadata payload, size: 1461 bytes.
2023-01-19 11:52:08 GMT | CORE | INFO | (pkg/collector/scheduler/scheduler.go:131 in Cancel) | Unscheduling check php_fpm:a2bc89b13d85861a
2023-01-19 11:52:08 GMT | CORE | INFO | (pkg/logs/schedulers/ad/scheduler.go:113 in Unschedule) | New source to remove: entity: docker://aa927fcfaee11e4d3246c4ebb47b6eeddc9b790188fd19af360f608697038698
2023-01-19 11:52:08 GMT | CORE | INFO | (pkg/autodiscovery/config_poller.go:105 in stream) | kubernetes-container-allinone provider: collected 0 new configurations, removed 3
2023-01-19 11:52:08 GMT | CORE | INFO | (pkg/serializer/serializer.go:403 in sendMetadata) | Sent metadata payload, size (raw/compressed): 4507/1216 bytes.
2023-01-19 11:52:08 GMT | CORE | INFO | (pkg/autodiscovery/config_poller.go:105 in stream) | kubernetes-container-allinone provider: collected 0 new configurations, removed 3
2023-01-19 11:52:09 GMT | CORE | WARN | (pkg/logs/internal/tailers/docker/tailer.go:181 in tryRestartReader) | Could not restart the docker reader for container aa927fcfaee1: Error: No such container: aa927fcfaee11e4d3246c4ebb47b6eeddc9b790188fd19af360f608697038698:
2023-01-19 11:52:09 GMT | CORE | INFO | (pkg/logs/internal/tailers/docker/tailer.go:108 in Stop) | Stop tailing container: aa927fcfaee1
2023-01-19 11:52:09 GMT | CORE | WARN | (pkg/logs/internal/tailers/docker/tailer.go:181 in tryRestartReader) | Could not restart the docker reader for container 3850f4b91c96: Error: No such container: 3850f4b91c96560f93ce4848a4bd4d39722ebc7e1799eb8ea7a19e8cdd5cbee8:
2023-01-19 11:52:09 GMT | CORE | INFO | (pkg/logs/internal/tailers/docker/tailer.go:108 in Stop) | Stop tailing container: 3850f4b91c96
2023-01-19 11:52:11 GMT | CORE | WARN | (pkg/workloadmeta/store.go:564 in notifyChannel) | collector ""ad-kubecontainerprovider"" did not close the event bundle channel in time, continuing with downstream collectors. bundle dump: {Events:[{Type:1 Entity:0xc002186180}] Ch:0xc001ed81e0}
2023-01-19 11:52:12 GMT | CORE | WARN | (pkg/workloadmeta/store.go:564 in notifyChannel) | collector ""ad-containerlistener"" did not close the event bundle channel in time, continuing with downstream collectors. bundle dump: {Events:[{Type:1 Entity:0xc002186000}] Ch:0xc003b3b200}
2023-01-19 11:52:18 GMT | CORE | WARN | (pkg/collector/python/datadog_agent.go:125 in LogMessage) | apache:d914e6062fc24d91 | (apache.py:94) | Caught exception HTTPConnectionPool(host='10.0.1.19', port=26869): Max retries exceeded with url: /server-status?auto (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8eb036e970>: Failed to establish a new connection: [Errno 111] Connection refused'))
2023-01-19 11:52:18 GMT | CORE | ERROR | (pkg/collector/worker/check_logger.go:69 in Error) | check:apache | Error running check: [{""message"": ""HTTPConnectionPool(host='10.0.1.19', port=26869): Max retries exceeded with url: /server-status?auto (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8eb036e970>: Failed to establish a new connection: [Errno 111] Connection refused'))...<<traceback truncated>>
2023-01-19 11:52:30 GMT | CORE | WARN | (pkg/workloadmeta/store.go:564 in notifyChannel) | collector ""ad-kubecontainerprovider"" did not close the event bundle channel in time, continuing with downstream collectors. bundle dump: {Events:[{Type:1 Entity:0xc002186600}] Ch:0xc000180de0}
2023-01-19 11:52:33 GMT | CORE | WARN | (pkg/collector/python/datadog_agent.go:125 in LogMessage) | apache:d914e6062fc24d91 | (apache.py:94) | Caught exception HTTPConnectionPool(host='10.0.1.19', port=26869): Max retries exceeded with url: /server-status?auto (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8eb0393a90>: Failed to establish a new connection: [Errno 111] Connection refused'))

Version:

Agent 7.41.1 - Commit: 4f39b9e - Serialization version: v5.0.39 - Go version: go1.18.9

Status:

===============
Agent (v7.41.1)
===============

  Status date: 2023-01-19 16:12:50.588 GMT (1674144770588)
  Agent start: 2023-01-19 15:40:50.638 GMT (1674142850638)
  Pid: 368357
  Go Version: go1.18.9
  Python Version: 3.8.14
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -3.138ms
    System time: 2023-01-19 16:12:50.588 GMT (1674144770588)

  Host Info
  =========
    bootTime: 2022-12-13 17:24:08 GMT (1670952248000)
    hostId: d672a673-e745-4673-aee9-776136130a78
    kernelArch: x86_64
    kernelVersion: 5.4.0-135-generic
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 20.04
    procs: 608
    uptime: 886h16m54s

  Hostnames
  =========
    hostname: copper
    socket-fqdn: localhost
    socket-hostname: copper
    hostname provider: os
    unused hostname providers:
      'hostname' configuration/environment: hostname is empty
      'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled
      aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames
      azure: azure_hostname_style is set to 'os'
      container: the agent is not containerized
      fargate: agent is not runnning on Fargate
      fqdn: 'hostname_fqdn' configuration is not enabled
      gce: unable to retrieve hostname from GCE: GCE metadata API error: Get "http://169.254.169.254/computeMetadata/v1/instance/hostname": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

  Metadata
  ========
    agent_version: 7.41.1
    config_apm_dd_url: 
    config_dd_url: 
    config_logs_dd_url: 
    config_logs_socks5_proxy_address: 
    config_no_proxy: []
    config_process_dd_url: 
    config_proxy_http: 
    config_proxy_https: 
    config_site: 
    feature_apm_enabled: true
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_logs_enabled: true
    feature_networks_enabled: true
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: true
    feature_processes_container_enabled: false
    flavor: agent
    hostname_source: os
    install_method_installer_version: datadog_formula-3.4
    install_method_tool: saltstack
    install_method_tool_version: saltstack-3005.1
    logs_transport: HTTP

=========
Collector
=========

  Running Checks
  ==============

    apache (4.2.0)
    --------------
      Instance ID: apache:38fee1db768862ab [OK]
      Configuration Source: container:docker://bbbc5a6d6df8d2a4b87af0a84b06a3cbbb2dcd88d625166ebc2fdf9029a763fa
      Total Runs: 128
      Metric Samples: Last Run: 21, Total: 2,688
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 6ms
      Last Execution Date : 2023-01-19 16:12:50 GMT (1674144770000)
      Last Successful Execution Date : 2023-01-19 16:12:50 GMT (1674144770000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 54
        version.raw: 2.4.54
        version.scheme: semver

      Instance ID: apache:4bfc7359ef2e5faa [OK]
      Configuration Source: container:docker://030df0dc14386862bdf704e7f2ae554a4807ebaca2b12f587fb4b0fbe7a90dd4
      Total Runs: 127
      Metric Samples: Last Run: 21, Total: 2,667
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 127
      Average Execution Time : 6ms
      Last Execution Date : 2023-01-19 16:12:40 GMT (1674144760000)
      Last Successful Execution Date : 2023-01-19 16:12:40 GMT (1674144760000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 54
        version.raw: 2.4.54
        version.scheme: semver

      Instance ID: apache:78f10fbd4b35429 [OK]
      Configuration Source: container:docker://34b9a7f741057296e39a3fc66284e17949943183941636e964b54d9243073da0
      Total Runs: 127
      Metric Samples: Last Run: 21, Total: 2,667
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 127
      Average Execution Time : 7ms
      Last Execution Date : 2023-01-19 16:12:38 GMT (1674144758000)
      Last Successful Execution Date : 2023-01-19 16:12:38 GMT (1674144758000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 54
        version.raw: 2.4.54
        version.scheme: semver

      Instance ID: apache:8af428de6c2145ba [OK]
      Configuration Source: container:docker://f6d9448e8e2e29867654d0cf7bb72468112e2b6a87302fef0e8d7bcd1018138b
      Total Runs: 128
      Metric Samples: Last Run: 21, Total: 2,688
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 6ms
      Last Execution Date : 2023-01-19 16:12:41 GMT (1674144761000)
      Last Successful Execution Date : 2023-01-19 16:12:41 GMT (1674144761000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 55
        version.raw: 2.4.55
        version.scheme: semver

      Instance ID: apache:b8da6b4ad6cab67a [OK]
      Configuration Source: container:docker://e9aaea3b36d3b88bb54e761868c3f2aae17f689d6e7b9317cb8a27b502b7db58
      Total Runs: 128
      Metric Samples: Last Run: 21, Total: 2,688
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 5ms
      Last Execution Date : 2023-01-19 16:12:42 GMT (1674144762000)
      Last Successful Execution Date : 2023-01-19 16:12:42 GMT (1674144762000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 54
        version.raw: 2.4.54
        version.scheme: semver

      Instance ID: apache:cb5ee39a3560a785 [OK]
      Configuration Source: container:docker://ecadd0b3743f23aa43be3d17a00ffd0eb49a316a4970763cad222a64b7651e55
      Total Runs: 128
      Metric Samples: Last Run: 21, Total: 2,688
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 6ms
      Last Execution Date : 2023-01-19 16:12:49 GMT (1674144769000)
      Last Successful Execution Date : 2023-01-19 16:12:49 GMT (1674144769000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 54
        version.raw: 2.4.54
        version.scheme: semver

      Instance ID: apache:dc7a5e672221ab [OK]
      Configuration Source: container:docker://b2c4b275a48cd61699acdf87241d07c938034fdb3852190dea96cd6d62b09c29
      Total Runs: 128
      Metric Samples: Last Run: 21, Total: 2,688
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 6ms
      Last Execution Date : 2023-01-19 16:12:43 GMT (1674144763000)
      Last Successful Execution Date : 2023-01-19 16:12:43 GMT (1674144763000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 43
        version.raw: 2.4.43
        version.scheme: semver

      Instance ID: apache:eeaaba4a4fb3840a [OK]
      Configuration Source: container:docker://0b7a7782ba6e2989a6a2730c8a979323f31871f05bc16086f056365996d15958
      Total Runs: 127
      Metric Samples: Last Run: 21, Total: 2,667
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 127
      Average Execution Time : 8ms
      Last Execution Date : 2023-01-19 16:12:39 GMT (1674144759000)
      Last Successful Execution Date : 2023-01-19 16:12:39 GMT (1674144759000)
      metadata:
        version.major: 2
        version.minor: 4
        version.patch: 54
        version.raw: 2.4.54
        version.scheme: semver

    consul (2.2.0)
    --------------
      Instance ID: consul:de4f5e752738aae4 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/consul.d/conf.yaml
      Total Runs: 128
      Metric Samples: Last Run: 71, Total: 9,155
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 3, Total: 391
      Average Execution Time : 41ms
      Last Execution Date : 2023-01-19 16:12:39 GMT (1674144759000)
      Last Successful Execution Date : 2023-01-19 16:12:39 GMT (1674144759000)
      metadata:
        version.major: 1
        version.minor: 14
        version.patch: 2
        version.raw: 1.14.2
        version.scheme: semver

    container
    ---------
      Instance ID: container [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/container.d/conf.yaml.default
      Total Runs: 127
      Metric Samples: Last Run: 832, Total: 105,664
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 32ms
      Last Execution Date : 2023-01-19 16:12:38 GMT (1674144758000)
      Last Successful Execution Date : 2023-01-19 16:12:38 GMT (1674144758000)

    containerd
    ----------
      Instance ID: containerd [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/containerd.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-01-19 16:12:45 GMT (1674144765000)
      Last Successful Execution Date : 2023-01-19 16:12:45 GMT (1674144765000)

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 127
      Metric Samples: Last Run: 9, Total: 1,136
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-01-19 16:12:37 GMT (1674144757000)
      Last Successful Execution Date : 2023-01-19 16:12:37 GMT (1674144757000)

    disk (4.7.1)
    ------------
      Instance ID: disk:be54d36a859b42e8 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml
      Total Runs: 128
      Metric Samples: Last Run: 72, Total: 9,216
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 12ms
      Last Execution Date : 2023-01-19 16:12:46 GMT (1674144766000)
      Last Successful Execution Date : 2023-01-19 16:12:46 GMT (1674144766000)

    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 99, Total: 12,672
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 119ms
      Last Execution Date : 2023-01-19 16:12:44 GMT (1674144764000)
      Last Successful Execution Date : 2023-01-19 16:12:44 GMT (1674144764000)

    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 127
      Metric Samples: Last Run: 5, Total: 635
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-01-19 16:12:36 GMT (1674144756000)
      Last Successful Execution Date : 2023-01-19 16:12:36 GMT (1674144756000)

    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 67, Total: 8,531
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms
      Last Execution Date : 2023-01-19 16:12:43 GMT (1674144763000)
      Last Successful Execution Date : 2023-01-19 16:12:43 GMT (1674144763000)

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 6, Total: 768
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-01-19 16:12:50 GMT (1674144770000)
      Last Successful Execution Date : 2023-01-19 16:12:50 GMT (1674144770000)

    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 20, Total: 2,560
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-01-19 16:12:42 GMT (1674144762000)
      Last Successful Execution Date : 2023-01-19 16:12:42 GMT (1674144762000)

    network (2.9.2)
    ---------------
      Instance ID: network:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 350, Total: 44,800
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 20ms
      Last Execution Date : 2023-01-19 16:12:49 GMT (1674144769000)
      Last Successful Execution Date : 2023-01-19 16:12:49 GMT (1674144769000)

    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 3
      Metric Samples: Last Run: 1, Total: 3
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 3
      Average Execution Time : 115ms
      Last Execution Date : 2023-01-19 16:10:56 GMT (1674144656000)
      Last Successful Execution Date : 2023-01-19 16:10:56 GMT (1674144656000)

    php_fpm (2.2.0)
    ---------------
      Instance ID: php_fpm:119913f20264ee70 [OK]
      Configuration Source: container:docker://8e3f1302b6fc41ca74d05d9b36b957c1ff83fdc2ddc742f773070dfd21f77134
      Total Runs: 128
      Metric Samples: Last Run: 7, Total: 896
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 5ms
      Last Execution Date : 2023-01-19 16:12:44 GMT (1674144764000)
      Last Successful Execution Date : 2023-01-19 16:12:44 GMT (1674144764000)

      Instance ID: php_fpm:67ff78c127f3d689 [OK]
      Configuration Source: container:docker://4572db2bfa9b422ebb14c85ec36986f229c209fdb783c8f29ee33453424dcf5b
      Total Runs: 128
      Metric Samples: Last Run: 7, Total: 896
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 4ms
      Last Execution Date : 2023-01-19 16:12:45 GMT (1674144765000)
      Last Successful Execution Date : 2023-01-19 16:12:45 GMT (1674144765000)

      Instance ID: php_fpm:76832668d3fef0b4 [OK]
      Configuration Source: container:docker://00441c041483816d80e1aec7177f94a9177e14528c94980342bd91ff98681b2e
      Total Runs: 128
      Metric Samples: Last Run: 7, Total: 896
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 5ms
      Last Execution Date : 2023-01-19 16:12:48 GMT (1674144768000)
      Last Successful Execution Date : 2023-01-19 16:12:48 GMT (1674144768000)

      Instance ID: php_fpm:7fa79cf1e1dde427 [OK]
      Configuration Source: container:docker://2bea0471b1c73e8d2fe37fd8cd9ebb2a51954d9307fb2fd0a0c00d06c1f5990d
      Total Runs: 128
      Metric Samples: Last Run: 7, Total: 896
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 128
      Average Execution Time : 9ms
      Last Execution Date : 2023-01-19 16:12:46 GMT (1674144766000)
      Last Successful Execution Date : 2023-01-19 16:12:46 GMT (1674144766000)

      Instance ID: php_fpm:9c8b94be9990fdaf [OK]
      Configuration Source: container:docker://8b6ed37d19ecdd666d8ee2b22a3347e9493fd8383f93928cc1caa9c9d2aca37d
      Total Runs: 127
      Metric Samples: Last Run: 7, Total: 889
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 127
      Average Execution Time : 4ms
      Last Execution Date : 2023-01-19 16:12:37 GMT (1674144757000)
      Last Successful Execution Date : 2023-01-19 16:12:37 GMT (1674144757000)

      Instance ID: php_fpm:e9fc1b071eee686c [OK]
      Configuration Source: container:docker://a3c65c04a4270c052e82096bd13036f576321c5f6c2c5a4933f9708c46c9d69c
      Total Runs: 127
      Metric Samples: Last Run: 7, Total: 889
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 127
      Average Execution Time : 4ms
      Last Execution Date : 2023-01-19 16:12:36 GMT (1674144756000)
      Last Successful Execution Date : 2023-01-19 16:12:36 GMT (1674144756000)

      Instance ID: php_fpm:f8c6d711ddbe909e [OK]
      Configuration Source: container:docker://0ad26c3330c4606b3f2e488335a30bacd7216721f3b35f1796a15b5f5467b39f
      Total Runs: 122
      Metric Samples: Last Run: 7, Total: 805
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 122
      Average Execution Time : 4ms
      Last Execution Date : 2023-01-19 16:12:47 GMT (1674144767000)
      Last Successful Execution Date : 2023-01-19 16:12:47 GMT (1674144767000)

    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 128
      Metric Samples: Last Run: 1, Total: 128
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Last Execution Date : 2023-01-19 16:12:41 GMT (1674144761000)
      Last Successful Execution Date : 2023-01-19 16:12:41 GMT (1674144761000)

========
JMXFetch
========

  Information
  ==================
  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    ClusterRole: 0
    ClusterRoleBinding: 0
    CronJob: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    HighPriorityQueueFull: 0
    Ingress: 0
    Job: 0
    Namespace: 0
    Node: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 0
    RoleBinding: 0
    Service: 0
    ServiceAccount: 0
    StatefulSet: 0

  Transaction Successes
  =====================
    Total number: 269
    Successes By Endpoint:
      check_run_v1: 127
      intake: 12
      metadata_v1: 3
      series_v2: 127

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

  API Keys status
  ===============
    API key ending with 68a13: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.eu - API Key ending with:
      - 68a13

==========
Logs Agent
==========
    Reliable: Sending compressed logs in HTTPS to agent-http-intake.logs.datadoghq.eu on port 443
    BytesSent: 1.26175933e+08
    EncodedBytesSent: 1.1644357e+07
    LogsProcessed: 126571
    LogsSent: 126455

  journald
  --------
    - Type: journald
      Status: OK
      Inputs:
        journald:default
      BytesRead: 6.257091e+06
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 6
      24h Peak Latency (ms): 6

  docker
  ------
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        00441c041483816d80e1aec7177f94a9177e14528c94980342bd91ff98681b2e
      BytesRead: 165349
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        030df0dc14386862bdf704e7f2ae554a4807ebaca2b12f587fb4b0fbe7a90dd4
      BytesRead: 364631
      Average Latency (ms): 2
      24h Average Latency (ms): 2
      Peak Latency (ms): 47
      24h Peak Latency (ms): 47
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        0ad26c3330c4606b3f2e488335a30bacd7216721f3b35f1796a15b5f5467b39f
      BytesRead: 2.31492e+06
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 5
      24h Peak Latency (ms): 5
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        0b7a7782ba6e2989a6a2730c8a979323f31871f05bc16086f056365996d15958
      BytesRead: 111703
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        10e0e4131fccf6d3d1282974b574ba6269500207f6951e10ad3c9ccf2f91a714
      BytesRead: 112846
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: cron
      Status: OK
      Inputs:
        186840bcf100786b21e83d8eff355df5289473501b65961172f43f01314fd7f3
      BytesRead: 55904
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        2bea0471b1c73e8d2fe37fd8cd9ebb2a51954d9307fb2fd0a0c00d06c1f5990d
      BytesRead: 494146
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 1
      24h Peak Latency (ms): 1
    - Type: docker
      Service: fabio
      Source: fabio
      Status: OK
      Inputs:
        33f2c1567affb8d771a6fa4486ab5ce4641ee616dc6e840ff1155751a8434e92
      BytesRead: 1.7317784e+07
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 18
      24h Peak Latency (ms): 18
    - Type: docker
      Service: tyrion-api
      Source: apache
      Status: OK
      Inputs:
        34b9a7f741057296e39a3fc66284e17949943183941636e964b54d9243073da0
      BytesRead: 1.793461e+06
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 13
      24h Peak Latency (ms): 13
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        4572db2bfa9b422ebb14c85ec36986f229c209fdb783c8f29ee33453424dcf5b
      BytesRead: 762618
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 8
      24h Peak Latency (ms): 8
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        69d63b38db79e4f42dde9423f32b8f4a1f71500ea1148089778051de3535bb1e
      BytesRead: 28564
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: varnish
      Status: OK
      Inputs:
        7db8c907c3b6cd0371f4625c3dba42616c57f60a59d4eb45a1a6722b7669a4ee
      BytesRead: 1.6281605e+07
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 19
      24h Peak Latency (ms): 19
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        84dcf31b36c0bc87e95eff8c6213d97465ffb4a7920644243702c3cb57e4a1b5
      BytesRead: 221330
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        8b6ed37d19ecdd666d8ee2b22a3347e9493fd8383f93928cc1caa9c9d2aca37d
      BytesRead: 58157
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        8e3f1302b6fc41ca74d05d9b36b957c1ff83fdc2ddc742f773070dfd21f77134
      BytesRead: 58272
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: nuxt
      Status: OK
      Inputs:
        8f80c410f6c8dbe3f5cc5e5ce2420b31db86277581b813bd227c5174e5d0ebec
      BytesRead: 2480
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: nuxt
      Status: OK
      Inputs:
        91160aebc9bde8e4f14e19ec81cfafa94c875f134076e045dbbb489a96bf3c5f
      BytesRead: 6503
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: nuxt
      Status: OK
      Inputs:
        9a16663605cc6da79be360f684ea3491421827976342a6811729ab35a81e4b5d
      BytesRead: 2480
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: varnish
      Status: OK
      Inputs:
        9a79a6948181a63a8a8106e8af12e87e30beac105fec096887f49533290e4324
      BytesRead: 1.6014792e+07
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 33
      24h Peak Latency (ms): 33
    - Type: docker
      Service: ***
      Source: cron
      Status: OK
      Inputs:
        a1a5f3b0710444f571fff7e9f056e1d930951d1e24af95f2ea031a8bd3173ea1
      BytesRead: 202824
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: php
      Status: OK
      Inputs:
        a3c65c04a4270c052e82096bd13036f576321c5f6c2c5a4933f9708c46c9d69c
      BytesRead: 281182
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 10
      24h Peak Latency (ms): 10
    - Type: docker
      Service: ***
      Source: varnish
      Status: OK
      Inputs:
        a9b97c06c2f7567b0e8c5543bec5cb4027a9a747b8630f925fe93b750fd1c9ba
      BytesRead: 406789
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 1
      24h Peak Latency (ms): 1
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        b2c4b275a48cd61699acdf87241d07c938034fdb3852190dea96cd6d62b09c29
      BytesRead: 46054
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: varnish
      Status: OK
      Inputs:
        b5b5b307ed342190eda4ee6b536ad9552d659429f9d1bcca60abc65a96df0521
      BytesRead: 111472
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: haproxy
      Status: OK
      Inputs:
        bb2549155c03dbf2cf62cdf5b38a630c3ed4b66e627916e83dec44e2073b09df
      BytesRead: 209855
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        bbbc5a6d6df8d2a4b87af0a84b06a3cbbb2dcd88d625166ebc2fdf9029a763fa
      BytesRead: 1.055271e+06
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 2
      24h Peak Latency (ms): 2
    - Type: docker
      Service: ***
      Source: varnish
      Status: OK
      Inputs:
        dae944d068b1ded2267f0f33da591731b459031a64ab60799577e7882778738e
      BytesRead: 109233
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        e9aaea3b36d3b88bb54e761868c3f2aae17f689d6e7b9317cb8a27b502b7db58
      BytesRead: 74944
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        ecadd0b3743f23aa43be3d17a00ffd0eb49a316a4970763cad222a64b7651e55
      BytesRead: 75385
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 1
      24h Peak Latency (ms): 1
    - Type: docker
      Service: ***
      Source: apache
      Status: OK
      Inputs:
        f6d9448e8e2e29867654d0cf7bb72468112e2b6a87302fef0e8d7bcd1018138b
      BytesRead: 767429
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 4
      24h Peak Latency (ms): 4

  container_collect_all
  ---------------------
    - Type: docker
      Service: ***
      Source: ***
      Status: OK
      Inputs:
        83d12ecf40a0ded7ed0001675cd642992dcd1581c5905358589cae65d975e37d
      BytesRead: 4825
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0
    - Type: docker
      Service: ***
      Source: ***
      Status: OK
      Inputs:
        bf69ecf38ae246eb87b787774860db4f32ecbe6bf0cb659ce2ffe91cd3566c7d
      BytesRead: 0
      Average Latency (ms): 0
      24h Average Latency (ms): 0
      Peak Latency (ms): 0
      24h Peak Latency (ms): 0

============
System Probe
============
  Status: Running
  Uptime: 31m45.000270514s
  Last Updated: 2023-01-19 16:12:38 GMT (1674144758000)

  NPM
  ===
    Status: Running
    Last Check: 2023-01-19 16:12:27 GMT (1674144747000)

=============
Process Agent
=============

  Version: 7.41.1
  Status date: 2023-01-19 16:13:00.117 GMT (1674144780117)
  Process Agent Start: 2023-01-19 15:40:50.786 GMT (1674142850786)
  Pid: 368358
  Go Version: go1.18.9
  Build arch: amd64
  Log Level: info
  Enabled Checks: [process rtprocess connections]
  Allocated Memory: 36,363,928 bytes
  Hostname: copper

  =================
  Process Endpoints
  =================
    https://process.datadoghq.eu - API Key ending with:
        - 68a13

  =========
  Collector
  =========
    Last collection time: 2023-01-19 16:12:59
    Docker socket: /var/run/docker.sock
    Number of processes: 389
    Number of containers: 34
    Process Queue length: 0
    RTProcess Queue length: 0
    Connections Queue length: 0
    Event Queue length: 0
    Pod Queue length: 0
    Process Bytes enqueued: 0
    RTProcess Bytes enqueued: 0
    Connections Bytes enqueued: 0
    Event Bytes enqueued: 0
    Pod Bytes enqueued: 0
    Drop Check Payloads: []

=========
APM Agent
=========
  Status: Running
  Pid: 368367
  Uptime: 1929 seconds
  Mem alloc: 8,985,784 bytes
  Hostname: copper
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.datadoghq.eu

  Receiver (previous minute)
  ==========================
    No traces received in the previous minute.

  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

==========
Aggregator
==========
  Checks Metric Sample: 324,293
  Dogstatsd Metric Sample: 1,020,844
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 127
  Series Flushed: 393,356
  Service Check: 6,004
  Service Checks Flushed: 6,090

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 1,020,843
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 159,459,477
  Udp Packet Reading Errors: 0
  Udp Packets: 141,172
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0
  Unterminated Metric Errors: 0

====
OTLP
====

  Status: Not enabled
  Collector status: Not running

Describe what happened:

I'm seeing this on average a couple of times a day on each of 12 cluster nodes. Through the day, as new services (consisting of one or more containers) are deployed and old versions of services terminated, logs for the new services stop appearing in datadog. Existing services continue to log without issue. The cause seems to be, from the logs, an error handling the shutting down of containers that results in the dd agent failing to handle further docker events until restarted.

Describe what you expected: I'd expect that dd agent should correctly handle the termination of docker containers and the identification and registration of new containers consistently.

Steps to reproduce the issue: While this issue happens frequently across all of our cluster nodes I have seen it also happen simply with datadog-agent and docker on a standard ubuntu setup and with manually started and stopped containers. I cannot, however, see a pattern of behaviour to guarantee the error is triggered.

Additional environment details (Operating System, Cloud provider, etc): I've managed to get this to happen with latest datadog-agent install and latest docker-ce install from apt on ubuntu 20.04 (with latest updates).

It may be of importance that our docker is running in subuid/subgid mode (with the config value "userns-remap": "default" in daemon.json). As datadog-agent doesn't handle standard json-based log processing in this docker mode (because docker writes the container data to a different folder in this mode) we also use journald as our docker log driver.

Our daemon.json docker config is:

{
  "storage-driver": "overlay2",
  "log-driver": "journald",
  "userns-remap": "default",
  "live-restore": true
}
grickit commented 1 year ago

Issue #12101 very likely related.

vickenty commented 1 year ago

This issue will be fixed in version 7.43.0 with https://github.com/DataDog/datadog-agent/pull/15138.

grickit commented 1 year ago

@vickenty As far as I can tell from release notes, this one didn't make it in. Can you confirm?

vickenty commented 1 year ago

@grickit The fix was merged without a release note unfortunately, but it was included in the release.