[Feature Request] High Availability to AKS Link/Tunnel Front Service

dhananjaya94 commented 4 years ago

What happened: We have 2 AKS clusters running, one used for non production and other one for production.

Non-Prod AKS cluster Uptime SLA not enabled
Prod AKS cluster Uptime SLA is enabled

In Non Production cluster there is a pod named tunnelfront in kube-system namespace In Production cluster there is a pod named aks-link in kube-system namespace.

From the support ticket 120073023000987 we go to know that both of the pods perform the same functionality. ( this support ticket was raised due to the detection of high restart count of the aks-link pod in the prod cluster by our internal monitoring)

The pods that you will see now listed as aks-link-### are what we used to know as tunnel front pods.

In previous months AKS upgraded its old tunnel Front to this new version of the tunnel, AKS uses this VPN tunnel facing from the customer cluster to the customer control-plane to provide a route from API server to customer nodes, which is required by a few Kubernetes features i.e. getting logs, port forwarding, etc.

In addition to that the support ticket stated that AKS is responsible in managing such components and have no public document on this.

I am very sorry about the delay in my response.
Unfortunately, there is no public documentation about this change since this is part of the cluster managed components and a custom version of OpenVPN, and this is just an upgrade of the existing tunnel front, in our documentation you will see vague references as "networking tunnel", for example https://docs.microsoft.com/en-us/azure/aks/support-policies#aks-support-coverage-for-worker-nodes, but nothing deep as Azure is completely accountable for those managed components
You can read about the openVPN project here: https://github.com/OpenVPN/openvpn

https://docs.microsoft.com/en-us/azure/aks/support-policies#aks-support-coverage-for-worker-nodes https://github.com/OpenVPN/openvpn

Recently a production outage was encountered due to the aks-link pod in pending state. Support issue 120101323000340. The root cause for the outage as per our understanding was the inability to talk to AKS API server by ingress controller pods to get the service endpoints, which returned 503 BadGateway. Azure AKS support was unable to give a detailed technical RCA on this

During the support call we were asked to restart the aks-link pod ( which is a managed addon by AKS). The deletion of the pending aks-link pod made the pod running and the issues related to kubectl logs was sorted. But the ingress controller pods returned 503 for the AKS customer workload services. After getting the AKS API server restarted via the Azure support the 503 issue was sorted and all the public facing services were up on our external monitoring system. Total downtime of 1 hour.

Currently also we observe restarts in prod aks-link pod

aks-link-7689f76f89-x78pb                    2/2     Running   1          14d

In non production tunnel front no such behaviour is observed.

tunnelfront-566bf9456c-ls2p6                 1/1     Running   0          35d

aks-link pod previous logs

❯ kubectl logs -p aks-link-7689f76f89-x78pb -n kube-system -c openvpn-client
Tue Oct 13 03:25:22 2020 WARNING: file '/etc/openvpn/certs/client.key' is group or others accessible
Tue Oct 13 03:25:22 2020 OpenVPN 2.4.4 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on May 14 2019
Tue Oct 13 03:25:22 2020 library versions: OpenSSL 1.1.1  11 Sep 2018, LZO 2.08
Tue Oct 13 03:25:22 2020 TCP/UDP: Preserving recently used remote address: [AF_INET]xx.xxx.xx.xx:1194
Tue Oct 13 03:25:22 2020 UDP link local: (not bound)
Tue Oct 13 03:25:22 2020 UDP link remote: [AF_INET]xx.xxx.xx.xx:1194
Tue Oct 13 03:25:22 2020 NOTE: UID/GID downgrade will be delayed because of --client, --pull, or --up-delay
Tue Oct 13 03:25:22 2020 [openvpn-server.5ea01a294b1ea30001e85aec] Peer Connection Initiated with [AF_INET]xx.xxx.xx.xx:1194
Tue Oct 13 03:25:23 2020 TUN/TAP device tun0 opened
Tue Oct 13 03:25:23 2020 do_ifconfig, tt->did_ifconfig_ipv6_setup=0
Tue Oct 13 03:25:23 2020 /sbin/ip link set dev tun0 up mtu 1500
Tue Oct 13 03:25:23 2020 /sbin/ip addr add dev tun0 192.0.2.2/24 broadcast 192.0.2.255
Tue Oct 13 03:25:23 2020 GID set to nogroup
Tue Oct 13 03:25:23 2020 UID set to nobody
Tue Oct 13 03:25:23 2020 WARNING: this configuration may cache passwords in memory -- use the auth-nocache option to prevent this
Tue Oct 13 03:25:23 2020 Initialization Sequence Completed
Tue Oct 13 19:47:53 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #88673 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Tue Oct 13 19:47:53 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #88674 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Sat Oct 17 01:36:14 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #37960 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Sat Oct 17 01:36:14 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #37961 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Sat Oct 17 01:36:14 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #37962 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Sat Oct 17 01:36:14 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #37963 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Sat Oct 17 01:36:14 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #37964 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Sat Oct 17 01:36:14 2020 AEAD Decrypt error: bad packet ID (may be a replay): [ #37965 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-
warnings
Thu Oct 22 21:35:14 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:35:24 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:35:34 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:35:44 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:35:54 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:36:04 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:36:14 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:36:24 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:36:34 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:36:44 2020 TLS Error: local/remote TLS keys are out of sync: [AF_INET]xx.xxx.xx.xx:1194 [0]
Thu Oct 22 21:36:48 2020 [openvpn-server.5ea01a294b1ea30001e85aec] Inactivity timeout (--ping-restart), restarting
Thu Oct 22 21:36:48 2020 SIGUSR1[soft,ping-restart] received, process restarting
Thu Oct 22 21:36:50 2020 TCP/UDP: Preserving recently used remote address: [AF_INET]xx.xxx.xx.xx:1194
Thu Oct 22 21:36:50 2020 UDP link local: (not bound)
Thu Oct 22 21:36:50 2020 UDP link remote: [AF_INET]xx.xxx.xx.xx:1194
Thu Oct 22 21:36:50 2020 [openvpn-server.5ea01a294b1ea30001e85aec] Peer Connection Initiated with [AF_INET]xx.xxx.xx.xx:1194
Thu Oct 22 21:36:51 2020 Preserving previous TUN/TAP instance: tun0
Thu Oct 22 21:36:51 2020 NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
Thu Oct 22 21:36:51 2020 /sbin/ip addr del dev tun0 192.0.2.2/24
RTNETLINK answers: Operation not permitted
Thu Oct 22 21:36:51 2020 Linux ip addr del failed: external program exited with error status: 2
Thu Oct 22 21:36:52 2020 ERROR: Cannot open TUN/TAP dev /dev/net/tun: Permission denied (errno=13)
Thu Oct 22 21:36:52 2020 Exiting due to fatal error

What you expected to happen: Appreciate to have following questions answered ( which Azure support is unable to answer )

In AKS what is the reason for having two types of pods to provide the same functionality ( In non-prod we see tunnel pods where as in prod we see aks-link pods ) ?
How is the high availability of aks-link pod is ensured in AKS ?
How is the failure in aks-link pod related to Uptime SLA of AKS ?

Overally high availability for aks-link/tunnel pods run in HA mode and auto recovery from failures (which is natively supported by Kubernetes).

How to reproduce it (as minimally and precisely as possible): Restart is observed often, the pending state was observed once and downtime was observed during that time.

Anything else we need to know?: N/A

Environment:

Kubernetes version (use kubectl version):

❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.13", GitCommit:"37c06f456fdb4d25e402b5fbcb72cd6a77a021a9", GitTreeState:"clean", BuildDate:"2020-09-18T21:59:14Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?) 4 ( Ds8 instance family )
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) Ballerina, Python, Go

ghost commented 4 years ago

Hi dhananjaya-senanayake, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

mboutet commented 4 years ago

I also observe frequent restarts of the aks-link pod, especially since this morning. I didn't experience an outage yet, but 1 or 2 times per hour, I have >30 events related to the HPA not being able to query the metrics API (resource metrics + external metrics) thus preventing proper autoscaling of my workloads. Immediately after these events, I see the aks-link pod being restarted.

This cluster has been running fine for the last few weeks and nothing has been modified on it.

For instance:

unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

unable to get external metric prod/rabbitmq-_batching-job_types-v1/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: batching,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get rabbitmq-_batching-job_types-v1.external.metrics.k8s.io)

Cluster is in eastus and has SLA enabled.

Versions:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T21:51:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T18:49:11Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

slynickel commented 4 years ago

We have a number of AKS clusters (more then 30) throughout Azure regions in the US and Europe. We are transitioning some of them to use SLA and are experiencing similar issues on the clusters where we have SLA turned on.

We can put rough numbers to this and are noticing patterns of restarts across our clusters. We have been experiencing it for a number of days. Y-axis is openvpn-client container restarts (as reported by kube-state-metrics deployment in each AKS cluster), X axis is central time, different colors are different AKS clusters. Between 9/28-10/22 (~25 days) we saw fewer then 10 open-vpnclient container restarts across ~20-30 clusters (we actively transitioned to use SLA in some clusters). From 10/22-10/27 (5 days) we have seen ~120 container restarts. Seems a bit sporadic, some clusters have only seen one or two, some see 10 or 15 over a few hours. ~7 clusters have seen more than 10 each.

It seems like there might have been a version update on 10/22? Many of the clusters saw a single pod restart around the same time.

All of our log lines match OPs more less. We don't always see the AEAD and TLS errors though. Sometimes it's just the Inactivity timeout (--ping-restart), restarting to Exiting due to fatal error lines.

slynickel commented 4 years ago

Last thing I'll add. I'm making this an explicit comment rather then editing my last post since it could useful for others. We correlated those restarts in my last post by Azure region. All are eastus or eastus2 clusters have has seen more then 7 openvpn-client container restarts in the last few days (25% of our SLA enabled clusters in eastus/eastus2). While other azure regions (westus, centralus, northcentralus, westeurope) have all seen less then 7 per cluster.

Not sure if that's other's experience but figured it would be beneficial to point out what we are seeing.

yangl900 commented 4 years ago

@slynickel @mboutet @dhananjaya-senanayake thanks for reporting the issue! I'm looking into this.

First I'd like to clarify the functionality of tunnel-front / aks-link. As the support engineer mentioned, the tunnel is used for the hosted API server talk to the worker node that lives in your private network. You will need it when you do kubectl logs or kubectl top node (as well as HPA). It does NOT affect anything in your cluster talk to API server (e.g. ingress controller). Any pod that needs to talk to API server will hit the API sever public IP directly. So @dhananjaya-senanayake the ingress controller not able to talk to API server is a different issue. I'll try to find the support ticket and get back a root cause.

Second question is why we have both aks-link and tunnel-front? Short answer is that it's in a transition. aks-link is based on VPN and utilize udp port, it's better architecture and newer. At this point it is only deployed in clusters with SLA enabled but the plan is that every cluster eventually will get consistent tunnel solution. There won't be 2 in the long run.

The aks-link do get updated as part of AKS release some time, so during the release the pod will be re-created if needed. We have been evaluating making aks-link HA as well to minimize the impact. cc: @jveski

mboutet commented 4 years ago

I went to the service health of my cluster on the portal and saw that there was an event this morning: The time of this event and the time at which my problems began (HPA not being able to get metrics, then followed by a restart of the aks-link pod) are approximately the same.

palma21 commented 4 years ago

@mboutet that is very not related to the previous comments. Happy to tackle it on a separate issue or support ticket. That specific event right there, means one of your nodes stopped responding (kubelet stopped posting live status to the API) If that happened or if the node had an issue k8s might restart or move your pods.

@slynickel we can definitely investigate, anything more than 2/3 per week is not common at would be good for us to investigate, if you have a ticket please do provide it and we'll jump into it. Anything below that might just be an update or restart/move by k8s which is one of the reasons we do want to provide this feature request soon.

dhananjaya94 commented 4 years ago

@slynickel @mboutet @dhananjaya-senanayake thanks for reporting the issue! I'm looking into this.

First I'd like to clarify the functionality of tunnel-front / aks-link. As the support engineer mentioned, the tunnel is used for the hosted API server talk to the worker node that lives in your private network. You will need it when you do kubectl logs or kubectl top node (as well as HPA). It does NOT affect anything in your cluster talk to API server (e.g. ingress controller). Any pod that needs to talk to API server will hit the API sever public IP directly. So @dhananjaya-senanayake the ingress controller not able to talk to API server is a different issue. I'll try to find the support ticket and get back a root cause.

To be more specific on this ingress related issue, only the ingresses with annotation https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#service-upstream is affected. Other services exposed through ingress without that annotation worked fine, no down alerts were received.

yangl900 commented 4 years ago

@dhananjaya-senanayake thanks for the information, that was very useful. I think I have found something that might explain the outage you experienced. We are still looking to confirm. Did it start from UTC time 2020-10-12T18:40:00Z?

dhananjaya94 commented 4 years ago

@yangl900 following is the report of downtime from our external monitoring system. Its in PDT.

the aks-link pod was in pending state more than 7 hours AFIR, when observed during the downtime. There was another aks-link pod which was running parallel ( old one in running state and the new pod in pending state ). Only after deleting the pending pod kubectl logs started to work. It did not resolve the 502 issue with ingress controller ( SSL handshake happened). After restarting the AKS API server by the Azure support 502 issue was mitigated. Unfortunately we could not gather any screenshots on the pod states neither aks-link pods during that time.

ohorvath commented 4 years ago

We have similar issues since last Wednesday. We've created at least 50 AKS clusters and all of them had issues with API access, frequent timeouts for kubectl, webhooks etc. We figured out that only clusters with uptime SLA had these issues, free tiers are not affected. All of our clusters had tunnelfront and not aks-link and sometimes tunnelfront restart solved the problem for a few minutes. It's so weird that SLA clusters are way more unreliable than free tiers.

slynickel commented 4 years ago

@palma21

we can definitely investigate, anything more than 2/3 per week is not common at would be good for us to investigate, if you have a ticket please do provide it and we'll jump into it. Anything below that might just be an update or restart/move by k8s which is one of the reasons we do want to provide this feature request soon.

Ticket: 2010270040011710

palma21 commented 4 years ago

@ohorvath are you saying your SLA clusters have tunnelfront? Do you happen to have any firewall/NVA in front on your clusters?

palma21 commented 4 years ago

@slynickel thanks, let me track it internally and communicate there.

ohorvath commented 4 years ago

@palma21 Yes, that's correct. I've created tons of clusters an all of them have tunnelfront and most of the time it's very unstable. API calls take forever, tiller deployments failing etc. We also have a case open. No firewall, no UDRs.

"sku": { "name": "Basic", "tier": "paid" },

tunnelfront-76569dd77f-qwq2j 1/1 Running 0 145m

palma21 commented 4 years ago

Ah, Thanks! We also just got your ticket from support. We know what happened, it's a bit different than this case and it was a bug on our side that was fixed last week, where your cluster didn't actually enabled the SLA feature correctly so it remained with tunnelfront which causes immediate instability since tunnelfront is not compatible with the the new control plane architecture that SLA uses (we're migrating all free clusters to it as well gradually as @yangl900 explained above) your support contact will reach out but your specific issue is fixed and we can even correct any clusters you might have with SLA and tunnelfront (just pass them o the the support engineer.

ohorvath commented 4 years ago

@palma21 Thanks. Just FYI, I've created a few clusters today and still have tunnelfront with uptime SLA. :) I'm not sure if the fix already reached all regions.

yangl900 commented 4 years ago

hi @ohorvath the fix is not rolled out to all regions yet. Please correct the creation template you use to set tier to "Paid" instead of "paid". The bug is a bit embarrassing, but that is the cause.

ohorvath commented 4 years ago

Thanks for the suggestion @yangl900 , but it didn't work.

Template:

"sku": { "name": "Basic", "tier": "Paid" }

AKS Cluster:

"sku": { "name": "Basic", "tier": "Free" }

It seems whether we provide 'paid' or 'Paid' in the ARM template, Azure creates a free cluster now.

robbiezhang commented 3 years ago

@ohorvath , can you make sure the sku property is at the top level of the managedCluster? It should be at the same level as "name", "location", "properties".