NTPClient.getEpochTime() jumps an hour

bradanlane commented 6 years ago

I am using the NTPClient on a Wemos D1 Mini (ESP8266). On occasion, my code is detecting that the return from NTPClient.update() then NTPClient.getEpochTime() has jumped an hour. I have tested with different NTP server pools (time.nist.gov and pool.ntp.org). The occurrence of the jump is not consistent. I've recorded it in teh morning, event, and over night.

I'm looking for debugging suggestions.

I have the following code fragment (with some gorpy calculations removed):

WiFiUDP ntpUDP;
NTPClient timeClient(ntpUDP, "pool.ntp.org", 0, 3600000); // interval is 1 hour
...
timeClient.begin();
...
... define and initialize some globals
...
void clock_loop() {
    current_ticks = millis();
    if (next_update < current_ticks) {
        if (timeClient.update()) {
            ... save values from previous iteration as last_epoc, last_hour, last_minute
            cur_epoch = timeClient.getEpochTime();
            ... fetch cur_hour and cur_minute and calc next_update
            Serial.printf("Clock updated: %02d, %02d (%lu  %lu)\n", cur_hour, cur_minute, cur_epoch);
        } else {
            // we will assume one minute has passed
            ... compute cur__hour and cur_minute and calc next_update (to be on a minute boundary)
            Serial.printf("Clock update failed: %02d:%02d\n", cur_hour, cur_minute);
        }
        if (ABS(COMBINE(_hours, _minutes) - COMBINE(last_hour, last_minute)) > 2) {
            // more than two minutes have elapsed. something went wrong
            Serial.printf("Clock jumped: from %02d:%02d to %02d:%02d\n", 
                        last_hour, last_minute, cur_hour, cur_minute);
        }
    }
}

Here is a snippet of my logging:

lock updated: 07:05  (1533798300 )
Clock updated: 07:06  (1533798360 )
Clock updated: 07:07  (1533798420 )
Clock updated: 07:08  (1533798480 )
Clock updated: 07:09  (1533798540 )
Clock updated: 07:10  (1533798600 )
Clock updated: 06:11  (1533795061 )
Clock jumped: from 07:10 to 06:11
Clock updated: 06:11  (1533795119 )
Clock updated: 06:12  (1533795120 )
Clock updated: 06:13  (1533795180 )
Clock updated: 06:14  (1533795240 )
Clock updated: 06:15  (1533795300 )

bradanlane commented 6 years ago

I forgot to add ...

The jump appears to match up with the underlying forced update.

When I review the extended log, I see a slight variation in the epoch when the underlying force update gets a new value from the NTP service. These variances are observable at 60 minute intervals which corresponds to the interval parameter when the NTPClient object was created. This leads me to believe the time jump is somehow related to the code related to the forever update.

If there is diagnostic debug data available from NTPClient, I'm happy to leverage that.

neu-rah commented 6 years ago

and its not "day light saving" stuff? sorry if you already checked that TL;DR

bradanlane commented 6 years ago

Thanks for the reply. Yes, I am using the Timezone library and one of the first steps I did was to remove that code.

bradanlane commented 6 years ago

My issue may be specific to the ESP8266 or the more specific Wemos D1 Mini.

bradanlane commented 6 years ago

The time jump does appear to the specified interval set. With the interval set to 1 hour, the jumps were approx an hour.

I just set the interval to 15 minutes and after a short while, NTPClient::forceUpdate() jumped by something close to 15 minutes.

begin new logging session
NTPClient::forceUpdate() 1534117800
NTPClient::forceUpdate() 1534118701
NTPClient::forceUpdate() 1534119661
NTPClient::forceUpdate() failed
Clock update failed: 00:37
NTPClient::forceUpdate() failed
Clock update failed: 00:38
NTPClient::forceUpdate() 1534120682
NTPClient::forceUpdate() 1534120747
Clock jumped: from 00:53 to 00:39
NTPClient::forceUpdate() 1534121712
NTPClient::forceUpdate() 1534122672

bradanlane commented 6 years ago

The error occurs in the forceUpdate() - but only after a forceUpdate() has failed because of a timeout.

FYI: In this example, the interval was set to 15 minutes. My code only calls NTPClient::update() once every minute. Thus, it is possible for NTPClient::forceUpdate() to occur after inetrval + 1 minute eg 16 minutes.

WiFiUDP ntpUDP;
NTPClient timeClient(ntpUDP, "time.nist.gov", 0, 900000);

The log shows the forceUpdate() was successful at 16:09, then failed on the next attempt at 16:25, and then succeed on the next attempt 1 minute later at 16:26. However, the next update was successful but off by the something related to the interval. This is consistent with the above test results.

The log now shows all four timestamps:

NTPClient::forceUpdate()  15:20:02  06:28:16  15:21:01  15:21:01
NTPClient::forceUpdate()  15:36:33  06:28:16  15:37:01  15:37:01
NTPClient::forceUpdate()  15:51:42  06:28:16  15:53:01  15:53:01
NTPClient::forceUpdate()  16:09:00  06:28:16  16:09:01  16:09:01
NTPClient::forceUpdate() timeout
Clock update failed: 16:25
NTPClient::forceUpdate()  16:24:09  06:28:16  16:25:07  16:25:07
NTPClient::forceUpdate()  16:24:35  06:28:16  16:26:04  16:26:04
Clock jumped: from 16:40 to 16:26
NTPClient::forceUpdate()  16:40:36  06:28:16  16:41:57  16:41:57
NTPClient::forceUpdate()  16:56:34  06:28:16  16:57:54  16:57:54

bradanlane commented 6 years ago

I changed my code so it calls NTPClient::update() as often as it can rather than only calling it once per minute. "Most of the time" it appears to recover from timeouts without causing the time jump problem.

begin new logging session
                          REFERENCE   ORIGIN   RECEIVE  TRANSMIT
NTPClient::forceUpdate()  19:32:13  06:28:16  19:32:29  19:32:29
NTPClient::forceUpdate() timeout
NTPClient::forceUpdate()  19:47:22  06:28:16  19:47:31  19:47:31
NTPClient::forceUpdate()  20:00:55  06:28:16  20:02:31  20:02:31
NTPClient::forceUpdate()  20:16:21  06:28:16  20:17:31  20:17:31
NTPClient::forceUpdate()  20:31:23  06:28:16  20:32:31  20:32:31
NTPClient::forceUpdate()  20:46:38  06:28:16  20:47:31  20:47:31
NTPClient::forceUpdate()  21:02:20  06:28:16  21:02:31  21:02:31
NTPClient::forceUpdate() timeout
NTPClient::forceUpdate()  21:17:29  06:28:16  21:17:33  21:17:33
NTPClient::forceUpdate()  21:32:01  06:28:16  21:32:33  21:32:33
NTPClient::forceUpdate()  21:46:11  06:28:16  21:47:33  21:47:33
NTPClient::forceUpdate()  22:02:02  06:28:16  22:02:33  22:02:33
NTPClient::forceUpdate()  22:16:19  06:28:16  22:17:33  22:17:33
NTPClient::forceUpdate()  22:30:59  06:28:16  22:32:33  22:32:33
NTPClient::forceUpdate() timeout
NTPClient::forceUpdate()  22:46:16  06:28:16  22:47:36  22:47:36
NTPClient::forceUpdate()  22:46:16  06:28:16  22:47:37  22:47:37
clockLoop(): clock jumped from local 19:02 to 18:47
NTPClient::forceUpdate()  23:01:14  06:28:16  23:02:37  23:02:37
NTPClient::forceUpdate()  23:16:26  06:28:16  23:17:38  23:17:38

tobozo commented 6 years ago

Just been experiencing the same symptoms and applied the changes seen in this pull request.

Now the time drift isn't observed anymore after a failed update (timeout) followed by a successful update.

bradanlane commented 6 years ago

Thanks. I've grabbed the code from the pull request and am starting my testing.

Update: still good after 34 hours - even with several timeouts. The calls to foreUptate() have drifted about 90 seconds over that same period.

bradanlane commented 6 years ago

I'll have to dig into the pull request details. I am still seeing the possibility of the time jump error:

NTPClient::forceUpdate() 06:02:09
NTPClient::forceUpdate() 06:17:09
NTPClient::forceUpdate() 06:32:09
NTPClient::forceUpdate() 06:47:09
NTPClient::forceUpdate() 07:02:10
NTPClient::forceUpdate() timeout
Clock update failed: 03:17
NTPClient::forceUpdate() 07:17:25
NTPClient::forceUpdate() 07:32:25
NTPClient::forceUpdate() timeout
Clock update failed: 03:47
NTPClient::forceUpdate() 07:47:27
NTPClient::forceUpdate() 07:47:28
clockLoop(): clock jumped from local 04:02 to 03:47
NTPClient::forceUpdate() 08:02:28
NTPClient::forceUpdate() 08:17:29
NTPClient::forceUpdate() 08:32:30
NTPClient::forceUpdate() 08:47:31
NTPClient::forceUpdate() 09:02:31
NTPClient::forceUpdate() 09:17:31
NTPClient::forceUpdate() 09:32:32
NTPClient::forceUpdate() 09:47:34
NTPClient::forceUpdate() 10:02:35
NTPClient::forceUpdate() 10:17:35
NTPClient::forceUpdate() 10:32:36
NTPClient::forceUpdate() 10:47:36
NTPClient::forceUpdate() 11:02:36
NTPClient::forceUpdate() timeout
Clock update failed: 07:17
NTPClient::forceUpdate() 11:32:39
clockLoop(): clock jumped from local 07:17 to 07:32
NTPClient::forceUpdate() 11:47:39

tobozo commented 6 years ago

I haven't had any for the last three days although it's querying every minute and got many timeouts.

This is the code I'm using, it's also patched to check the NTP packets for validity:

https://gist.github.com/tobozo/8bcd16391025352dcf9c8ce66f8fdae0

bradanlane commented 6 years ago

I included your error checking and I'm always failing on

if ((ntpPacket[0] & 0b00111000) >> 3 < 0b100) ...

When I look at the result, the value of 0b001 for the version. I'm using time.nist.gov as my pool server.

tobozo commented 6 years ago

yep this server is long outdated

bradanlane commented 6 years ago

Which pool server do you recommend?

tobozo commented 6 years ago

It probably depends on your location, if you're in the US, you can pick yours as long as it's V4 it's supposed to be more secure.

bradanlane commented 6 years ago

All of the US NTP servers in the pool appear to be V3. A random test of North American servers also were not higher than V3.

I’ll use that as the minimum test.

tobozo commented 6 years ago

I'm giving up this library, so many things are misleading like the latency management during UDP requests, the lack of DST support for timezone, and the too many directions it takes where it should instead converge to actual standards.

Using the AsyncUDP library could solve some of those cumulated latency problems, but after fiddling with this code, I can say that the very idea of doing some time synchronization job in an asynchronous fashion is just plain masochism :rofl:

This library is just perfect for learning, but it's not being practical for any serious project.

For example there's this functional conflict between Timelib::setSyncInterval() and timeClient.setUpdateInterval(), the misleading use of the term epoch to store localized time (The Unix epoch is the time 00:00:00 UTC on 1 January 1970 source), unappropriate recursion, and a ton of open issues pointing to the different anomalies and side effects of all this.

This example turns out to be a perfect codebase form y needs so I can add packet validation, drift calculation and dynamic interval management.

bradanlane commented 6 years ago

Thanks. I am giving the alternative code a try. while the NTP code is nearly identicle, the update mechanism seems to not care about timeouts and keeps consistent timing.

rselec commented 5 years ago

I have the same issue with an ESP8266. Rest of software runs fun but sometimes the time reported from the NTP lib is is half an hour (or multiples of that) to early. My update Intervall is 30 minutes so that should be the same problem. That occurs very randomly, sometimes it is running smoothly for several weeks, than I have that problem only several hours after a reboot.

Since no one is working here anymore I build my own functions based on the ESP WiFi NTP example which is working fine, I need only the minute of day. But I may discovered the reason for the problem:

What happens if the response of the NTP-Server needs more than a second?

The function NTPClient::forceUpdate() escapes in line 76 and won't handle any late responses. But the late UDP response packet will still be received by the UDP layer.

An half hour later the next NTP sync should happen and NTPClient::forceUpdate() is called again. It doesn't clear old UDP packages and just request a new one from the time server. In line 75 it receives immediately the old, timed out answer from the request a half hour ago, because it is still in the buffer of the UDP Layer. So the the timestamp from the old request is handled as the current one and the time shifts almost exactly the sync interval time to the past.

So we should at least delete received UDP packets at the beginning of NTPClient::forceUpdate() before requesting the new NTP packet, to make "sure" that the received packet is really the answer of our latest request. I don't know much of NTP and it might be better to implement a check based of the information of the UDP packet. Maybe there an ID to identify the packer or so.

Another problem might be a UDP buffer overflow when to many UDP packets are timed out...

Best Bimbo385

JHale716 commented 5 years ago

One thing that needs consideration is the TOS for the time pool, updates under 30 minutes are frowned upon. This may be contributing to the problem as the time server is seeing the 1 minute polling as too often and is throttling the responses. One of these is the KOD and it returns packets with the same timestamp as the send, which may be triggering the different in the time taking into account local offset etc.

seba1213 commented 4 years ago

I encountered a similar problem in my project, I have a refresh rate of 10 minutes, I have jumps multiples of 10 minutes, and the discrepancy may increase with time, it was 30 minutes. I have esp-01 I interesting what will happen if I set a day

joefernwright commented 2 years ago

I encountered exactly the same problem - update interval set to 1h and for some hours the lib keeps running the right clock and then suddenly...randomly it jumps back 1h. Unacceptable behavior. It makes no sense to use this lib.

Platform - ESP8266 (WEMOS D1 mini)

arduino-libraries / NTPClient

NTPClient.getEpochTime() jumps an hour #52