Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

Clock skew detected in AKS clusters #1317

Closed AlexDCraig closed 3 years ago

AlexDCraig commented 4 years ago

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Here is a picture of the clock skew estimates across the three environments in question:

image

The legend has been omitted for confidentiality reasons.

We also host the same service set in the west-europe region, and have not seen clock skew impact resources in that region.

Environment:

kmoussa commented 4 years ago

I have the same in West Europe/North Europe region on two clusters kubernetes version is 1.14.8

balajivenkataraman commented 4 years ago

We have the same problem in UK South. Kubernetes cluster 1.14.8 on 3 clusters and all of them report "clock skew detected for node(s):"

jon-walton commented 4 years ago

We're also frequently experiencing this in East Asia, however not on any of our South Central US clusters

rpocase commented 4 years ago

Seeing the same problem in east-us (single cluster, 4 nodes in one node pool). This started occurring after upgrading to 1.16.7 (from 1.15.5). Seems to flap fairly frequently.

monaka commented 4 years ago

Same here. wus-2, 1.15.10. And I guess this is not caused by AKS but Linux kernel or lower layer.

I'm watching my Linux instances on Azure VM with node_exporter. The issue is caused there also.

bdschaap commented 4 years ago

This is happening to our AKS clusters as well but not our on-prem Rancher K8s clusters.

github-actions[bot] commented 4 years ago

Action required from @Azure/aks-pm

ghost commented 4 years ago

Action required from @Azure/aks-pm

palma21 commented 4 years ago

We're looking into this with the Azure Linux team in order to improve time sync reliability. I'll post here as soon as we have findings.

CC @juan-lee @xuto2 @qike-ms

masters3d commented 4 years ago

Are there way any mitigation steps?

ghost commented 4 years ago

Action required from @palma21.

palma21 commented 4 years ago

You can use a Daemon set to change the timesync servers in case that helps.

We found a few cases where the ubuntu default servers might not be reliable across the world so we'll be moving to host sync.

ghost commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

djsly commented 4 years ago

Any quick remediation / command to force a resync ? We are currently affected on UKSOUTH

ffais commented 3 years ago

I have same problem in WESTEU, maybe this can help troubleshooting process:

azureuser@aks-nodepool1-vmss0:~$ sudo service systemd-timesyncd status
● systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendor preset: enabled)
  Drop-In: /lib/systemd/system/systemd-timesyncd.service.d
           └─disable-with-time-daemon.conf
   Active: inactive (dead)
Condition: start condition failed at Mon 2020-11-16 10:56:27 UTC; 18s ago
           ConditionFileIsExecutable=!/usr/sbin/chronyd was not met
     Docs: man:systemd-timesyncd.service(8)

PS: @djsly I found a workaround, run this command on each node of the cluster:

 sudo service chrony start
 sudo chronyd -q 'server 0.europe.pool.ntp.org iburst'
djsly commented 3 years ago

Can you please share you Vmss Os disk image version ?

There was a known issue with 2020.10.28 which was patch afterward. You might just need to recreate your agent pool and you should be all set.

On Nov 16, 2020, at 6:01 AM, ffais notifications@github.com wrote:

 I have same problem in WESTEU, maybe this can help troubleshooting process:

azureuser@aks-nodepool1-vmss0:~$ sudo service systemd-timesyncd status ● systemd-timesyncd.service - Network Time Synchronization Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendor preset: enabled) Drop-In: /lib/systemd/system/systemd-timesyncd.service.d └─disable-with-time-daemon.conf Active: inactive (dead) Condition: start condition failed at Mon 2020-11-16 10:56:27 UTC; 18s ago ConditionFileIsExecutable=!/usr/sbin/chronyd was not met Docs: man:systemd-timesyncd.service(8)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ffais commented 3 years ago

This is the version actually in use 2020.10.28.

djsly commented 3 years ago

Replace your node pool it should fix it

On Nov 16, 2020, at 8:42 AM, ffais notifications@github.com wrote:

 This is the version actually in use 2020.10.28.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xuto2 commented 3 years ago

or do a node image upgrade to the latest vhd version 2020.11.11 when it's ready in this week's release.

ghost commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ghost commented 3 years ago

This issue will now be closed because it hasn't had any activity for 15 days after stale. AlexDHoffer feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

palma21 commented 3 years ago

Host-sync and chrony are now used from 2021-03-08 release