Closed dhananjaya94 closed 3 years ago
Hi dhananjaya-senanayake, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
I also observe frequent restarts of the aks-link pod, especially since this morning. I didn't experience an outage yet, but 1 or 2 times per hour, I have >30 events related to the HPA not being able to query the metrics API (resource metrics + external metrics) thus preventing proper autoscaling of my workloads. Immediately after these events, I see the aks-link pod being restarted.
This cluster has been running fine for the last few weeks and nothing has been modified on it.
For instance:
unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
unable to get external metric prod/rabbitmq-_batching-job_types-v1/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: batching,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get rabbitmq-_batching-job_types-v1.external.metrics.k8s.io)
Cluster is in eastus and has SLA enabled.
Versions:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T21:51:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T18:49:11Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
We have a number of AKS clusters (more then 30) throughout Azure regions in the US and Europe. We are transitioning some of them to use SLA and are experiencing similar issues on the clusters where we have SLA turned on.
We can put rough numbers to this and are noticing patterns of restarts across our clusters. We have been experiencing it for a number of days. Y-axis is openvpn-client container restarts (as reported by kube-state-metrics deployment in each AKS cluster), X axis is central time, different colors are different AKS clusters. Between 9/28-10/22 (~25 days) we saw fewer then 10 open-vpnclient container restarts across ~20-30 clusters (we actively transitioned to use SLA in some clusters). From 10/22-10/27 (5 days) we have seen ~120 container restarts. Seems a bit sporadic, some clusters have only seen one or two, some see 10 or 15 over a few hours. ~7 clusters have seen more than 10 each.
It seems like there might have been a version update on 10/22? Many of the clusters saw a single pod restart around the same time.
All of our log lines match OPs more less. We don't always see the AEAD and TLS errors though. Sometimes it's just the Inactivity timeout (--ping-restart), restarting
to Exiting due to fatal error
lines.
Last thing I'll add. I'm making this an explicit comment rather then editing my last post since it could useful for others. We correlated those restarts in my last post by Azure region. All are eastus or eastus2 clusters have has seen more then 7 openvpn-client container restarts in the last few days (25% of our SLA enabled clusters in eastus/eastus2). While other azure regions (westus, centralus, northcentralus, westeurope) have all seen less then 7 per cluster.
Not sure if that's other's experience but figured it would be beneficial to point out what we are seeing.
@slynickel @mboutet @dhananjaya-senanayake thanks for reporting the issue! I'm looking into this.
First I'd like to clarify the functionality of tunnel-front / aks-link. As the support engineer mentioned, the tunnel is used for the hosted API server talk to the worker node that lives in your private network. You will need it when you do kubectl logs
or kubectl top node
(as well as HPA). It does NOT affect anything in your cluster talk to API server (e.g. ingress controller). Any pod that needs to talk to API server will hit the API sever public IP directly. So @dhananjaya-senanayake the ingress controller not able to talk to API server is a different issue. I'll try to find the support ticket and get back a root cause.
Second question is why we have both aks-link and tunnel-front? Short answer is that it's in a transition. aks-link is based on VPN and utilize udp port, it's better architecture and newer. At this point it is only deployed in clusters with SLA enabled but the plan is that every cluster eventually will get consistent tunnel solution. There won't be 2 in the long run.
The aks-link do get updated as part of AKS release some time, so during the release the pod will be re-created if needed. We have been evaluating making aks-link HA as well to minimize the impact. cc: @jveski
I went to the service health of my cluster on the portal and saw that there was an event this morning: The time of this event and the time at which my problems began (HPA not being able to get metrics, then followed by a restart of the aks-link pod) are approximately the same.
@mboutet that is very not related to the previous comments. Happy to tackle it on a separate issue or support ticket. That specific event right there, means one of your nodes stopped responding (kubelet stopped posting live status to the API) If that happened or if the node had an issue k8s might restart or move your pods.
@slynickel we can definitely investigate, anything more than 2/3 per week is not common at would be good for us to investigate, if you have a ticket please do provide it and we'll jump into it. Anything below that might just be an update or restart/move by k8s which is one of the reasons we do want to provide this feature request soon.
@slynickel @mboutet @dhananjaya-senanayake thanks for reporting the issue! I'm looking into this.
First I'd like to clarify the functionality of tunnel-front / aks-link. As the support engineer mentioned, the tunnel is used for the hosted API server talk to the worker node that lives in your private network. You will need it when you do
kubectl logs
orkubectl top node
(as well as HPA). It does NOT affect anything in your cluster talk to API server (e.g. ingress controller). Any pod that needs to talk to API server will hit the API sever public IP directly. So @dhananjaya-senanayake the ingress controller not able to talk to API server is a different issue. I'll try to find the support ticket and get back a root cause.To be more specific on this ingress related issue, only the ingresses with annotation https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#service-upstream is affected. Other services exposed through ingress without that annotation worked fine, no down alerts were received.
@dhananjaya-senanayake thanks for the information, that was very useful. I think I have found something that might explain the outage you experienced. We are still looking to confirm. Did it start from UTC time 2020-10-12T18:40:00Z
?
@yangl900 following is the report of downtime from our external monitoring system. Its in PDT.
the aks-link pod was in pending state more than 7 hours AFIR, when observed during the downtime. There was another aks-link pod which was running parallel ( old one in running state and the new pod in pending state ). Only after deleting the pending pod kubectl logs
started to work. It did not resolve the 502 issue with ingress controller ( SSL handshake happened). After restarting the AKS API server by the Azure support 502 issue was mitigated. Unfortunately we could not gather any screenshots on the pod states neither aks-link pods during that time.
We have similar issues since last Wednesday. We've created at least 50 AKS clusters and all of them had issues with API access, frequent timeouts for kubectl, webhooks etc. We figured out that only clusters with uptime SLA had these issues, free tiers are not affected. All of our clusters had tunnelfront and not aks-link and sometimes tunnelfront restart solved the problem for a few minutes. It's so weird that SLA clusters are way more unreliable than free tiers.
@palma21
we can definitely investigate, anything more than 2/3 per week is not common at would be good for us to investigate, if you have a ticket please do provide it and we'll jump into it. Anything below that might just be an update or restart/move by k8s which is one of the reasons we do want to provide this feature request soon.
Ticket: 2010270040011710
@ohorvath are you saying your SLA clusters have tunnelfront? Do you happen to have any firewall/NVA in front on your clusters?
@slynickel thanks, let me track it internally and communicate there.
@palma21 Yes, that's correct. I've created tons of clusters an all of them have tunnelfront and most of the time it's very unstable. API calls take forever, tiller deployments failing etc. We also have a case open. No firewall, no UDRs.
"sku": { "name": "Basic", "tier": "paid" },
tunnelfront-76569dd77f-qwq2j 1/1 Running 0 145m
Ah, Thanks! We also just got your ticket from support. We know what happened, it's a bit different than this case and it was a bug on our side that was fixed last week, where your cluster didn't actually enabled the SLA feature correctly so it remained with tunnelfront which causes immediate instability since tunnelfront is not compatible with the the new control plane architecture that SLA uses (we're migrating all free clusters to it as well gradually as @yangl900 explained above) your support contact will reach out but your specific issue is fixed and we can even correct any clusters you might have with SLA and tunnelfront (just pass them o the the support engineer.
@palma21 Thanks. Just FYI, I've created a few clusters today and still have tunnelfront with uptime SLA. :) I'm not sure if the fix already reached all regions.
hi @ohorvath the fix is not rolled out to all regions yet. Please correct the creation template you use to set tier to "Paid" instead of "paid". The bug is a bit embarrassing, but that is the cause.
Thanks for the suggestion @yangl900 , but it didn't work.
Template:
"sku": { "name": "Basic", "tier": "Paid" }
AKS Cluster:
"sku": { "name": "Basic", "tier": "Free" }
It seems whether we provide 'paid' or 'Paid' in the ARM template, Azure creates a free cluster now.
@ohorvath , can you make sure the sku property is at the top level of the managedCluster? It should be at the same level as "name", "location", "properties".
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
@ohorvath are you still experiencing the issue?
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Thanks for reaching out. I'm closing this issue as it was marked with "Fix released" and it hasn't had activity for 7 days.
What happened: We have 2 AKS clusters running, one used for non production and other one for production.
In Non Production cluster there is a pod named
tunnelfront
in kube-system namespace In Production cluster there is a pod namedaks-link
in kube-system namespace.From the support ticket
120073023000987
we go to know that both of the pods perform the same functionality. ( this support ticket was raised due to the detection of high restart count of the aks-link pod in the prod cluster by our internal monitoring)In addition to that the support ticket stated that AKS is responsible in managing such components and have no public document on this.
https://docs.microsoft.com/en-us/azure/aks/support-policies#aks-support-coverage-for-worker-nodes https://github.com/OpenVPN/openvpn
Recently a production outage was encountered due to the aks-link pod in pending state. Support issue
120101323000340
. The root cause for the outage as per our understanding was the inability to talk to AKS API server by ingress controller pods to get the service endpoints, which returned503 BadGateway
. Azure AKS support was unable to give a detailed technical RCA on thisDuring the support call we were asked to restart the aks-link pod ( which is a managed addon by AKS). The deletion of the pending aks-link pod made the pod running and the issues related to
kubectl logs
was sorted. But the ingress controller pods returned 503 for the AKS customer workload services. After getting the AKS API server restarted via the Azure support the 503 issue was sorted and all the public facing services were up on our external monitoring system. Total downtime of 1 hour.Currently also we observe restarts in prod aks-link pod
In non production tunnel front no such behaviour is observed.
aks-link pod previous logs
What you expected to happen: Appreciate to have following questions answered ( which Azure support is unable to answer )
Overally high availability for aks-link/tunnel pods run in HA mode and auto recovery from failures (which is natively supported by Kubernetes).
How to reproduce it (as minimally and precisely as possible): Restart is observed often, the pending state was observed once and downtime was observed during that time.
Anything else we need to know?: N/A
Environment:
kubectl version
):Related Issues: https://github.com/Azure/AKS/issues/727 https://github.com/Azure/AKS/issues/1603