DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.8k stars 1.18k forks source link

There was an error querying the ntp host #1532

Open adamgotterer opened 6 years ago

adamgotterer commented 6 years ago

Describe what happened: Launched the agent and the logs look like:

[ AGENT ] 2018-03-27 18:42:06 UTC | INFO | (transaction.go:129 in Process) | Successfully posted payload to "https://6-1-0-app.agent.datadoghq.com/intake/?api_key=*************************370fe"
[ AGENT ] 2018-03-27 18:42:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:49888->45.76.244.202:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:27 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:58747->198.137.202.56:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:37 UTC | INFO | (serializer.go:196 in SendJSONToV1Intake) | Sent processes metadata payload, size: 410 bytes.
[ AGENT ] 2018-03-27 18:42:42 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:44762->38.229.71.1:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check cpu
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check cpu
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check disk
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check disk
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check docker
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check docker
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check file_handle
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check file_handle
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check io
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check io
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check load
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check load
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check memory
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check memory
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check network
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:302 in work) | Done running check network
[ AGENT ] 2018-03-27 18:42:52 UTC | INFO | (runner.go:246 in work) | Running check ntp
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:56134->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (runner.go:302 in work) | Done running check ntp
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (runner.go:246 in work) | Running check uptime
[ AGENT ] 2018-03-27 18:42:57 UTC | INFO | (runner.go:302 in work) | Done running check uptime
[ AGENT ] 2018-03-27 18:43:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:43058->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:43:27 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:46164->107.161.29.207:123: i/o timeout
[ AGENT ] 2018-03-27 18:43:42 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:39099->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:43:57 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:52643->171.66.97.126:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:34427->162.210.111.4:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:21 UTC | INFO | (transaction.go:129 in Process) | Successfully posted payload to "https://6-1-0-app.agent.datadoghq.com/api/v1/check_run?api_key=*************************370fe"
[ AGENT ] 2018-03-27 18:44:27 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:53556->204.11.201.10:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:42 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:59500->172.98.193.44:123: i/o timeout
[ AGENT ] 2018-03-27 18:44:57 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:33902->96.244.96.19:123: i/o timeout
[ AGENT ] 2018-03-27 18:45:12 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 172.17.0.5:36223->198.137.202.56:123: i/o timeout

Additional environment details (Operating System, Cloud provider, etc): Docker, AWS, Amazon Linux

shine17 commented 6 years ago

am also facing the same issue . any update on this issue ? any datadog engineer working on this ? please prioritize this

2018-04-14 06:17:25 UTC | INFO | (runner.go:246 in work) | Running check ntp
2018-04-14 06:17:30 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:37261->216.218.220.101:123: i/o timeout
2018-04-14 06:17:30 UTC | INFO | (runner.go:302 in work) | Done running check ntp
2018-04-14 06:17:30 UTC | INFO | (runner.go:246 in work) | Running check uptime
2018-04-14 06:17:30 UTC | INFO | (runner.go:302 in work) | Done running check uptime
2018-04-14 06:17:45 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:49068->162.210.111.4:123: i/o timeout
2018-04-14 06:18:00 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:57120->216.218.220.101:123: i/o timeout

am using datadog docker image datadog/agent:6.1.2-jmx as a side car container in K8s

It seems because of this time sync issue , am not seeing the jmx metrics in datadog dashboards. https://github.com/DataDog/datadog-agent/blob/68f10761cbcc3b541f645fc4f5cefc65036c3794/pkg/collector/corechecks/network/ntp.go#L122

shine17 commented 6 years ago

This is what i get when i do a datadog agent ntp check from the side car

root@something*********-hj2hx:/opt/datadog-agent/bin/agent# ./agent check ntp -r -l INFO
%!s(int=442840760) | INFO | (tagger.go:78 in Init) | starting the tagging system
%!s(int=442840760) | INFO | (runner.go:92 in NewRunner) | Runner started with 1 workers.
%!s(int=442840760) | INFO | (collector.go:51 in NewCollector) | Embedding Python 2.7.14 (default, Apr  4 2018, 16:58:02) [GCC 4.7.2]
%!s(int=442840760) | INFO | (file.go:69 in Collect) | File Configuration Provider: searching for configuration files at: /etc/datadog-agent/conf.d
%!s(int=442840760) | INFO | (file.go:69 in Collect) | File Configuration Provider: searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
%!s(int=442840760) | WARN | (file.go:73 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
%!s(int=442840760) | WARN | (check.go:243 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
%!s(int=442840760) | WARN | (check.go:244 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
%!s(int=442840760) | WARN | (check.go:269 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (disk).
%!s(int=442845760) | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:55411->162.210.111.4:123: i/o timeout
%!s(int=442850760) | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp 25.128.37.38:43615->192.155.90.13:123: i/o timeout
=== Service Checks ===
[
  {
    "check": "ntp.in_sync",
    "host_name": "something**********hj2hx",
    "timestamp": 1523688165,
    "status": 3,
    "message": "",
    "tags": null
  },
  {
    "check": "ntp.in_sync",
    "host_name": "something*************-7b64fd57fb-hj2hx",
    "timestamp": 1523688170,
    "status": 3,
    "message": "",
    "tags": null
  }
]
%!s(int=442850760) | ERROR | (host.go:168 in getCPUInfo) | failed to retrieve cpu info at init time
=========
Collector
=========

  Running Checks
  ==============
    ntp
    ---
      Total Runs: 2
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 2
pbudzon commented 6 years ago

@adamgotterer @shine17 check that your firewalls/security groups allow outgoing connections on high ports (or source port 123 or target port 123). Regular NTP makes connections from port 123 into port 123, but datadog ntp check initiates connection from high ports (example in log above: 25.128.37.38:55411->162.210.111.4:123). I had the same error logged when my firewall was restricting high ports (and only allowing 123<->123).

adamgotterer commented 6 years ago

@pbudzon I'm on AWS with DD running on an ECS cluster. That machines running those containers have egress rules for TCP open on ports 0 - 65535. So I don't think its a security group issue.

acmcelwee commented 6 years ago

@adamgotterer what about your Network ACL, though?

adamgotterer commented 6 years ago

Just double checked the network and ACL and it's allow all traffic on all protocols outbound.

pbudzon commented 6 years ago

Remember that network acls are stateless so you need to enable traffic in as well as out.

SleepyBrett commented 6 years ago

... why doesn't datadog just trust the host's time?

pbudzon commented 6 years ago

This doesn’t have much to do with trust. It’s one of the default checks (like pulling out cpu and memory until) which validates that your system’s time didn’t drift off - you can see the time difference (between system’s time and ntp reported time) in datadog metrics, just like you can see cpu, memory and bunch of other stuff out of the box. If you have ntpd or similar service enabled and working on your server then this check is usually one you can get by without, but still it’s nice to have it. Especially if you’re doing any time-sensitive stuff on the server, like crypto or some authentications.

bompus commented 6 years ago

I'm getting the same error. Dedicated server. Ports opened.

2018-05-15 20:55:10 UTC | INFO | (ntp.go:122 in Run) | There was an error querying the ntp host: read udp x.x.x.x:60440->64.113.44.55:123: i/o timeout

bruno-carrier-lookout commented 6 years ago

Any update on this?

mad42 commented 6 years ago

on a fresh installation :

agent.log INFO UTC | INFO | (ntp.go:123 in Run) | There was an error querying the ntp host: read udp x.x.x.x:52870->138.96.64.10:123: i/o timeout

still broken.

simar7 commented 6 years ago

We're experiencing this as well. This seems to be causing the dd agent to stop reporting momentarily and as a result our monitors start to fire for No Data alerts. Any ETA on the fix?

neoghostz commented 6 years ago

Could we look at support the ability to pass a runtime var to define an NTP host for the Docker container within AWS so we could leverage their internal endpoint (169.254.169.123)

As a number of organisations aren't always keen to allow NTP in/out of their VPC/Nat Instances.

ChrisMcKee commented 6 years ago

Isn't there an option to just turn this off?

asmoran commented 6 years ago

We're also having this issue, and it appears to be intermittent. The agent will suddenly recover and start posting metrics briefly, then later fail again for a while.

micahsmith commented 5 years ago

It looks as though there are configuration options for NTP (https://docs.datadoghq.com/integrations/ntp/#configuration).

The page reads "The Agent enables the NTP check by default, but if you want to configure the check yourself, edit the file ntp.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory."

There doesn't seem to be an option to disable NTP checks (although stating that the check is enabled by default does seem to imply there is a way to turn it off), for organizations that want to keep NTP internal, NTP server locations are configurable.

olivielpeau commented 5 years ago

Hi all,

Apologies for the late reply. As @micahsmith mentioned, you can configure the Agent's NTP check to query your own/your service provider's NTP servers by following the instructions at https://docs.datadoghq.com/integrations/ntp/#configuration.

Also, as with any other Agent check that's enabled by default, you can disable the check completely by removing the file at <agent_conf_dir>/conf.d/ntp.d/conf.yaml.default and restarting the Agent (on a standard Linux install, <agent_conf_dir> is /etc/datadog-agent).

That said, we don't recommend disabling the NTP check entirely as: 1) it allows you to know when your host's clock is skewed 2) an Agent running on a host on which the system clock is significantly skewed (approx. >60s) may report data so much in the past or future that it'll affect your graphs and monitors on Datadog.

We've implemented significant improvements to the NTP check in v6.4.0 and 6.5.0, if you're having issues with the NTP check please upgrade to Agent >= 6.5.0. If you still run into issues with the NTP check after upgrading please reach out to our support team and send them your Agent's logs.

ahharu commented 5 years ago

Having this issue on K8S Deployment on AWS , version of agent 6.10.2

dabcoder commented 5 years ago

@ahharu Thanks for the heads up, could you send us a note via support@datadoghq.com with a flare and some details about your configuration? We could then assess the situation.

albertvaka commented 5 years ago

We have merged a couple PRs for the upcoming 6.14 release that should make this better. Please let us know if this is still an issue for you after upgrading.