apache / nuttx-apps

Apache NuttX Apps is a collection of tools, shells, network utilities, libraries, interpreters and can be used with the NuttX RTOS
https://nuttx.apache.org/
Apache License 2.0
270 stars 512 forks source link

DNS issues on NTP client #2440

Open duduita opened 1 month ago

duduita commented 1 month ago

During NTP server querying through the ntpclient.c, different NTP server domain names (e.g., 0.uk.pool.ntp.org, 1.uk.pool.ntp.org) might resolve to the same set of IP addresses due to DNS caching. This can lead to repeated queries to the same non-responsive IP addresses, resulting in failures to obtain the correct time.

For example, in the following, there are some logs that I added to ntpclient.c, in order to understand why the NTP was failing:

[   51.046000] [25] (Info)  '0.pool.ntp.org' resolved to: 216.238.113.58
[   51.046000] [25] (Info) ntpclient.c-447-gethostbyname for 0.pool.ntp.org OK
[   51.046000] [25] (Info) ntpclient.c-480-Sending a NTP packet
[   51.055000] [25] (Info) ntpclient.c-509-sendto ret: 68
[   51.056000] [25] (Info) ntpclient.c-515-Recv a NTP packet
[   56.055000] [25] (Info) ntpclient.c-521-recvfrom nbytes: -1
[   56.056000] [25] (Info)  '0.pool.ntp.org' resolved to: 216.238.113.58
[   56.056000] [25] (Info) ntpclient.c-447-gethostbyname for 0.pool.ntp.org OK
[   56.056000] [25] (Info) ntpclient.c-480-Sending a NTP packet
[   56.063000] [25] (Info) ntpclient.c-509-sendto ret: 68
[   56.063000] [25] (Info) ntpclient.c-515-Recv a NTP packet
[   61.065000] [25] (Info) ntpclient.c-521-recvfrom nbytes: -1
[   61.066000] [25] (Info)  '0.pool.ntp.org' resolved to: 216.238.113.58
[   61.066000] [25] (Info) ntpclient.c-447-gethostbyname for 0.pool.ntp.org OK
[   61.066000] [25] (Info) ntpclient.c-480-Sending a NTP packet
[   61.075000] [25] (Info) ntpclient.c-509-sendto ret: 68
[   61.075000] [25] (Info) ntpclient.c-515-Recv a NTP packet
[   66.075000] [25] (Info) ntpclient.c-521-recvfrom nbytes: -1
[   66.076000] [25] (Info)  '0.pool.ntp.org' resolved to: 216.238.113.58
[   66.076000] [25] (Info) ntpclient.c-447-gethostbyname for 0.pool.ntp.org OK
[   66.076000] [25] (Info) ntpclient.c-480-Sending a NTP packet
[   66.085000] [25] (Info) ntpclient.c-509-sendto ret: 68
[   66.085000] [25] (Info) ntpclient.c-515-Recv a NTP packet
[   71.085000] [25] (Info) ntpclient.c-521-recvfrom nbytes: -1
[   71.086000] [25] (Info)  '0.pool.ntp.org' resolved to: 216.238.113.58
[   71.086000] [25] (Info) ntpclient.c-447-gethostbyname for 0.pool.ntp.org OK
[   71.086000] [25] (Info) ntpclient.c-480-Sending a NTP packet
[   71.095000] [25] (Info) ntpclient.c-509-sendto ret: 68
[   71.095000] [25] (Info) ntpclient.c-515-Recv a NTP packet
[   76.095000] [25] (Info) ntpclient.c-521-recvfrom nbytes: -1
[   76.095000] [25] (Info) ntpclient.c-563-ERROR: recvfrom() failed: 11
[   76.095000] [25] (Info) ntpclient.c-589-The NTP client is terminating

To mitigate this issue, a possible option is to flush the DNS cache after cycling through all configured NTP servers, ensuring that subsequent DNS resolutions provide potentially new and responsive IP addresses, thereby increasing the likelihood of successful time synchronization. However, I cannot manipulate the DNS cache from the user space, unless I create an API for it.

Overall, do you have a workaround or a hack that I can use in order to solve this NTP issue? Or at least to force a new IP resolution for an NTP hostname after some failures?

acassis commented 1 month ago

@duduita thank you for finding and reporting this issue!

@wengzhe did you see that?

wengzhe commented 1 month ago

Hi @duduita , since 12.1.0, the DNS caching will become invalid after the TTL from the DNS server has expired, then we'll send a new query for the domain names if it's being looked up. I found the TTL is always less than 100s (and normally ~20s) for 0.pool.ntp.org in my local environment. Maybe your DNS server gives you non-responsive IP addresses with a longer TTL which causes this problem.

There is a hack that may force resolving the domains: set CONFIG_NETDB_DNSCLIENT_LIFESEC to a shorter value, e.g. 5 seconds, then after 5sec the cache will become invalid and the domain will be resolved again if you do the lookup.