DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.73k stars 1.17k forks source link

[BUG] NTP check fails with default config #20369

Open kanongil opened 8 months ago

kanongil commented 8 months ago

Agent Environment

Agent 7.48.1 - Commit: e3fa058 - Serialization version: v5.0.93 - Go version: go1.20.8

Describe what happened:

The ntp check stopped working, using default config.

Describe what you expected:

That it works…

Steps to reproduce the issue:

  1. Install agent on any Hetzner Cloud instance

Additional environment details (Operating System, Cloud provider, etc):

Running journalctl -u datadog-agent -g 'ntp\.go|cloudproviders\.go' -o cat shows that the Agent incorrectly detects an AWS environment, and chooses to default to their NTP server IP:

2023-10-24 10:06:51 CEST | CORE | INFO | (pkg/util/cloudproviders/cloudproviders.go:54 in DetectCloudProvider) | Cloud provider AWS detected
2023-10-24 10:06:51 CEST | CORE | INFO | (pkg/util/cloudproviders/cloudproviders.go:91 in GetCloudProviderNTPHosts) | Using NTP servers from AWS cloud provider: ["169.254.169.123"]
2023-10-24 10:06:57 CEST | CORE | ERROR | (pkg/collector/corechecks/net/ntp.go:175 in Run) | Failed to get clock offset from any ntp host
2023-10-24 10:21:57 CEST | CORE | ERROR | (pkg/collector/corechecks/net/ntp.go:175 in Run) | Failed to get clock offset from any ntp host
2023-10-24 10:36:57 CEST | CORE | ERROR | (pkg/collector/corechecks/net/ntp.go:175 in Run) | Failed to get clock offset from any ntp host

This will eventually also report:

2023-10-24 08:48:03 CEST | CORE | WARN | (pkg/collector/corechecks/net/ntp.go:210 in queryOffset) | Couldn't query the ntp host 169.254.169.123 for 10 times in a row: read udp <redacted-ip>:60474->169.254.169.123:123: i/o timeout

This issue is caused by a combination of:

  1. Hetzner Cloud exposing an AWS compatible metadata endpoint at http://169.254.169.254/latest/meta-data/instance-id
  2. DD agent assuming that this means, that it is a full AWS environment in its check here.
kanongil commented 8 months ago

FYI, as a local workaround, this can be fixed using a custom ntp.d/conf.yaml file:

init_config:

instances:
  - hosts:
      - 0.datadog.pool.ntp.org
      - 1.datadog.pool.ntp.org
      - 2.datadog.pool.ntp.org
      - 3.datadog.pool.ntp.org 
utikpuhlik commented 3 months ago

Unless you're using hetzner provider, I just override values in /etc/systemd/timesyncd.conf from:

NTP=ntp.hetzner.com

to:

NTP=0.datadog.pool.ntp.org 1.datadog.pool.ntp.org 2.datadog.pool.ntp.org 3.datadog.pool.ntp.org ntp.hetzner.com
froque commented 1 week ago

We went with cloud_provider_metadata: [] in /etc/datadog-agent/datadog.yaml to workaround this