Remove clock drift checks from connectivity monitor

serathius commented 1 year ago

What would you like to be added?

I would like to propose removal of logs prober found high clock drift as they are incorrectly implying that etcd is impacted by clock drift.

To my knowledge, there is no impact of clock difference on etcd version 3.

Raft itself doesn't not depend on time in any way. It measures time passage for things like health probes, but doesn't compare time between members. The only part of etcd that could be impacted is Leases, see https://github.com/etcd-io/etcd/issues/9768#issuecomment-391564180.

The connectivity monitor that reports the time drift was introduced for v2 etcd (https://github.com/etcd-io/etcd/pull/3210). In v3 etcd leases were rewritten to depend on time difference, thus should not be affected. https://github.com/etcd-io/etcd/pull/3834 Leases also use monotonic time (https://github.com/etcd-io/etcd/pull/6888, https://github.com/etcd-io/etcd/pull/8507) meaning time changes should not impact ttl.

I expect the connectivity monitor stayed due to etcd v3.5 still officially supporting v2 API. Next release v3.6 removes v2 API, so we can remove the clock drift detection too.

Please let me know if you are aware of any place that etcd could be impacted by clock drift.

Why is this needed?

Prevents user confusion about clock drift impact on etcd.

Aditya-Sood commented 1 year ago

hi @serathius would you be working on this issue or is it open for contribution?

serathius commented 1 year ago

Opening for discussion. @ptabor @ahrtr @jmhbnz

ccing everyone involved with the lease work @xiang90 @gyuho @hexfusion @yichengq @jonboulle @yichengq @heyitsanthony

ahrtr commented 1 year ago

The lease depends on wall time, if a member's local clock drift, then the lease will be affected. Simply put, all usage of time.Now() and time.After/time.Before may get invalid results in that situation when clock drift. So suggest to keep the warning ("prober found high clock drift") as it's for now.

The existing lease has big design issue, we need to think about how to refactor it, which has already been tracked by the roadmap.

serathius commented 1 year ago

The lease depends on wall time, if a member's local clock drift, then the lease will be affected.

I don't think impact of clock drift on wall time matters. By impact on lease I mean consistency issue. Situations where clock drift could cause one member to consider lease expired while other member doesn't.

To clarify by clock drift a small error in physical clock that accumulates over a time. It might be error of 1 second per month that over a year accumulates to couple of minutes. This is negligible for lease wall time calculation as leases are meant to be short, Kubenetes lease for events is unusual because it's for 2h. For 2h lease the wall time impact of clock drift should negligible (below 1 second) which is acceptable.

Clock drift between members would matter if TTL was calculated in exact deadline, not by time difference. For example if we have lease with TTL 1h, it doesn't matter if one member calculates time from 18:00:00 to 19:00:00 while other member is 10 second forward and calculates 18:00:10 to 19:00:10.

xiang90 commented 1 year ago

@serathius

Two things:

Yes. Clock drifts between different nodes should not affect etcd v3. We designed it that way and that is why all the leases get renewed/extended automatically once there is a leader switch. We can do some handshake and renew with a shorter period to make the lease more accurate even under a voluntary leader switch.
Keeping the clock in sync is still good... It makes debugging 100 times simpler. In any environment with NTP enabled, we should expect a less than 500ms clock difference. So warning on that is probably still OK..... I would not suggest removing it. But really up to the current maintainers since it is not a correctness issue.

serathius commented 1 year ago

Thanks @xiang90 for confirming there is no correctness issue. I can agree that having clock drift in cluster is not great, however this is an external problem to etcd. Many assumptions break with clock drift, take centralized logging, it becomes useless for debugging if logs are collected from nodes with clock drift.

I don't think this is inherently an etcd problem. I had multiple users scared asking me what's the impact on etcd. Will it not perform well? Is the whole cluster consistency is under risk? Overall I think it's not a good idea to try to solve issues that users don't have. Etcd is not a monitoring system, so it should not monitor or alert on clock drift. It just confuses user about why etcd cares about clock drift.

We can warn users about clock drift making etcd debugging hard. We can recommend that users run NTP and monitor their clock drift, even provide a link to external tools, but etcd should not take this responsibility on itself.

Aditya-Sood commented 1 year ago

hi @serathius, @xiang90 since the log is already of WARN type, I suppose there is no contribution expected for this issue?

tjungblu commented 1 year ago

I'm also +1 for removing it. WDYT about adding a timestamp to endpoint/status? That way we can still have some indication when we go through (disconnected) customer logs.

serathius commented 1 year ago

@Aditya-Sood We haven't made decision on how to proceed yet.

serathius commented 1 year ago

WDYT about adding a timestamp to endpoint/status? That way we can still have some indication when we go through (disconnected) customer logs.

Don't think it will help. Clock drift at the moment of request can be totally unrelated to clock drift present at the moment of logs were written.

serathius commented 1 year ago

As confirmed in https://github.com/etcd-io/etcd/issues/16432#issuecomment-1710630587 I would like to repeat the proposal of to remove the prober found high clock drift log. I think it causes confusion and doesn't help with debugging (more detail in https://github.com/etcd-io/etcd/issues/16432#issuecomment-1711288764)

ping @ahrtr @jmhbnz @wenjiaswe

jmhbnz commented 1 year ago

Hey Team - I've been following this thread and held off commenting as I'm still not fully familiar with the underlying code in question. However to answer the ping above, my vote would be to continue to inform users that clock drift over a certain threshold exists, but ensure this is done in such a way that it is clear there is no impact on etcd consistency.

ahrtr commented 1 year ago

Situations where clock drift could cause one member to consider lease expired while other member doesn't.

Isn't this a problem caused by clock drift? Especially when the problematic member is a leader.

Overall, I don't think this ticket deserve much time to discuss before the issue I mentioned in https://github.com/etcd-io/etcd/issues/16432#issuecomment-1685912014 is resolved, especially https://github.com/etcd-io/etcd/issues/15247

serathius commented 1 year ago

Situations where clock drift could cause one member to consider lease expired while other member doesn't.

Isn't this a problem caused by clock drift? Especially when the problematic member is a leader.

No, because:

Decision that lease expires is done by leader and executed via quorum.
Clock drift doesn't impact leader decision as leases duration is not long enough to be impacted by it. The longest leases we have (Kubernetes events) have TTL of 2 hours. This is not enough for clock drift to matter.

ahrtr commented 1 year ago

Decision that lease expires is done by leader and executed via quorum.

FYI. It's being executed by quorum instead of being agreed/consensused by quorum.

Also per your logic, the issue https://github.com/etcd-io/etcd/issues/15247 should NOT happen, because the out of date leader (which gets stuck on writing for a long time) will never get the consensus.

Clock drift doesn't impact leader decision as leases duration is not long enough to be impacted by it. The longest leases we have (Kubernetes events) have TTL of 2 hours. This is not enough for clock drift to matter.

Pls do not assume any user cases.

serathius commented 1 year ago

FYI. It's being executed by quorum instead of being agreed/consensused by quorum.

What I meant here is that the decision that lease should be invalidated is made by leader and then proposed to raft.

Also per your logic, the issue https://github.com/etcd-io/etcd/issues/15247 should NOT happen, because the out of date leader (which gets stuck on writing for a long time) will never get the consensus.

No, that's not true. My understanding: (please correct me if any of those points is incorrect)

Lease grant request is sent, creation is negotiated via quorum and applied to all members backend. 1a. If member is leader it also start clock to count down lease TTL. 1b. Read request (ListLeases, Lease) about lease TTL will be forwarded to leader and leader will return time from the countdown clock it started.
If leader changes, 2a. old leader will stop the lease countdown clock 2b. new leader will start clock again with original TTL. Notice that TTL is reset. (Assuming that optional feature of lease checkpointing is disabled).
Lease TTL passes, clock started by leader will prompt them to send a LeaseRevoke request. 3a. Read requests about lease TTL will return 1 second until lease is properly revoked.
Lease Revoke is executed by quorum and applied to all member backends. 4a. Read request about lease will return that lease doesn't exist.

Issue https://github.com/etcd-io/etcd/issues/15247 is caused by point 2a not properly executed. Old leader doesn't know that it should step down and countdown clock is ticking. This results in old leader executing point 3 even though it shouldn't be. I have pointed out those issues in https://github.com/etcd-io/etcd/issues/15944 long time ago.

Clock drift doesn't influence any of those steps, as cluster only depends on countdown clock on leader. If leader is changed, the TTL is reset. If leader clock is 10 seconds behind other members, it doesn't matter it the time difference it will count will still be TTL.

Clock drift doesn't impact leader decision as leases duration is not long enough to be impacted by it. The longest leases we have (Kubernetes events) have TTL of 2 hours. This is not enough for clock drift to matter.

Pls do not assume any user cases.

We need to make assumptions, especially about leases which were designed for short living leader election tokens. 2 hour leases are already a big problem for Kubernetes due to lack of checkpointing.

Imagine that you generate 1 GB of Kubernetes events within an hour and you have an hour of TTL. Every time there is a leader change (can easily happened multiple times in hour), the TTL will be reset. One leader election you have 2GB, two leader changes, you have 3 GB and so on.

prensgold commented 4 months ago

Hi ,

Our Postgresql Patroni cluster is built on 3 etcd. Due to the problem we experienced with NTP, 1 of the Patroni cluster nodes rebooted. Before the reboot, it also giving error as below . Does the NTP problem affect etcd and cause the node to reboot?

Our Etcd VERSION: 3.5.13

Jun 25 14:10:06 lprdendvtb01 adclient[1681]: INFO AUDIT_TRAIL|Centrify Suite|Trusted Path|1.0|2700|Trusted path granted|5|user=lprdendvtb01$@DOM.LOCAL pid=1681 utc=1719313806836 centrifyEventID=23700 DASessID=N/A DAInst=N/A status=GRANTED server=ldap/aplan01.dom.local@DOM.LOCAL Jun 25 14:10:06 lprdendvtb01 adclient[1681]: WARN dns.resolver DNS server 127.0.0.1 is down. (ErrCode: 111) : Connection refused Jun 25 14:20:36 lprdendvtb01 adclient[1681]: INFO AUDIT_TRAIL|Centrify Suite|Trusted Path|1.0|2700|Trusted path granted|5|user=lprdendvtb01$@DOM.LOCAL pid=1681 utc=1719314436837 centrifyEventID=23700 DASessID=N/A DAInst=N/A status=GRANTED server=ldap/aplan01.dom.local@DOM.LOCAL Jun 25 14:20:36 lprdendvtb01 adclient[1681]: WARN dns.resolver DNS server 127.0.0.1 is down. (ErrCode: 111) : Connection refused Jun 25 14:24:04 lprdendvtb01 adclient[1681]: INFO AUDIT_TRAIL|Centrify Suite|Trusted Path|1.0|2700|Trusted path granted|5|user=lprdendvtb01$@DOM.LOCAL pid=1681 utc=1719314644592 centrifyEventID=23700 DASessID=N/A DAInst=N/A status=GRANTED server=ldap/aplanisym1.dom.local@DOM.LOCAL Jun 25 14:24:13 lprdendvtb01 bash[1245]: {"level":"info","ts":"2024-06-25T14:24:13.263699+0300","caller":"mvcc/index.go:214","msg":"compact tree index","revision":1082821} Jun 25 14:24:13 lprdendvtb01 bash[1245]: {"level":"info","ts":"2024-06-25T14:24:13.276227+0300","caller":"mvcc/kvstore_compaction.go:68","msg":"finished scheduled compaction","compact-revision":1082821,"took":"12.463401ms","hash":3971677063,"current-db-size-bytes":1171456,"current-db-size":"1.2 MB","current-db-size-in-use-bytes":745472,"current-db-size-in-use":"746 kB"} Jun 25 14:24:13 lprdendvtb01 bash[1245]: {"level":"info","ts":"2024-06-25T14:24:13.276258+0300","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":3971677063,"revision":1082821,"compact-revision":1081403} Jun 25 14:29:37 lprdendvtb01 bash[1245]: {"level":"warn","ts":"2024-06-25T14:29:37.129853+0300","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"b89c3cc08bb44b1b","clock-drift":"59.583480115s","rtt":"420.306µs"} Jun 25 14:29:37 lprdendvtb01 bash[1245]: {"level":"warn","ts":"2024-06-25T14:29:37.129874+0300","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"b89c3cc08bb44b1b","clock-drift":"59.583314098s","rtt":"586.372µs"} Jun 25 14:29:45 lprdendvtb01 bash[1245]: {"level":"warn","ts":"2024-06-25T14:29:45.952063+0300","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"8.897µs","request":"header: put:<key:\"/service/pg15_endor/members/lprdendvtb03\" value_size:227 lease:5412077135619815544 >","response":"","error":"lease not found"} Jun 25 14:31:17 lprdendvtb01 kernel: The list of certified hardware and cloud instances for Red Hat Enterprise Linux 9 can be viewed at the Red Hat Ecosystem Catalog, https://catalog.redhat.com. Jun 25 14:31:17 lprdendvtb01 kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-362.13.1.el9_3.x86_64 root=/dev/mapper/vgroot-lvroot ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/vgroot-lvswap rd.lvm.lv=vgroot/lvroot rd.lvm.lv=vgroot/lvswap rhgb quiet Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask' Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256' Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256' Jun 25 14:31:17 lprdendvtb01 kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User

etcd-io / etcd

Remove clock drift checks from connectivity monitor #16432

What would you like to be added?

Why is this needed?