Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 306 forks source link

Is this ready for production? #727

Closed contessa-zoey closed 5 years ago

contessa-zoey commented 5 years ago

Good Morning,

I am eager to adopt k8s, but the issues in this repo are a little off-putting to me in terms of using AKS instead of hand rolling my own cluster. Are folks using this in production? Are many of these issues just the result of edge cases?

Sorry if I've asked this in the wrong place, I'm happy to ask elsewhere. I just want to make the best decision possible before rolling a cluster.

MarkTopping commented 5 years ago

Morning.

With apologies to Microsoft, but, I would suggest that make this decision based on the uptime you require. If you need 99.99% availability from your cluster then no, it is not production ready. Over the past 6 months I've endured a few issues which require MS support and take time to resolve. Right now, as of this morning I cannot deploy into one of my clusters due to a random issue.

That said, there are lots of good points too, and if a little downtime, or, inability to admin your cluster for a while is something you can absorb, then it might be worth going for it with the hope that these kinks are ironed out soon.

ghost commented 5 years ago

Unfortunately I need to agree with @MarkTopping. AKS is not production ready, and was in an even worse state when it went GA. AKS was nothing but a pain from the minute we adopted it. Multiple premier support tickets most of which went unanswered for days at a time, general failures and errors. AKS is a few kubernetes versions ahead of GKE at this point which makes me even more nervous. They haven't resolved major concerns that people have and they keep pushing forward with kubernetes versions. A lot of people will say just use ACS but AKS is using this behind the scenes and then you need to manage your own cluster. Even up to a month ago any user could essentially get cluster admin rights by running the --admin flag...

We made a sizable commitment to Azure and we have regretted it for a few months now. If you want a managed kubernetes service, I suggest going with GKE. We made the migration and havent been happier.

To be fair I havent used AKS in 2-3 months, but looking at this repo reaffirms the decision to move to GKE.

kvpt commented 5 years ago

Same here,

Created about ten clusters, none stand more than 3 days before encountering different kinds of errors, some clusters were even in error at creation time. AKS also lacks basic kubernetes functionality like node pools which have been postponed for focus on reliability. Morevover, the support we had, was slow and completely useless.

Like many others we migrated to GKE the last month without encountering any problem on it.

It's really a shame because the other azure products are good and we use them flawlessly. We will certainly try again later next year but at the moment several issues are show stopper.

Slater-Victoroff commented 5 years ago

Have been using aks for quite some time. Even before GA. Aks is in no way shape or form ready for production and it probably never will be. Azure will constantly destroy your nodes in new and increasingly inventive ways without any kind of remorse. Recently our pause container inexplicably crashed, causing kubernetes to try spinning up new copies forever as the pause container was no longer capable of communicating status. The response we got from support? "Oh yea, we know about that, it's an issue, your pause container will just crash sometimes. Sorry". Only after several days of hounding them and having them assure us that any AKS issues must be on our side. This is very typical Azure support. They start by gaslighting you and telling you that any issue must be on your side, take days or even weeks to response only to say that they've been aware of the issue for some time and simply haven't published any information on it.

During the past 12+ months we have reported outages to Azure as frequently as they have reported them to us. We're a 15 person company and the fact that AKS not only has critical outages every couple of days, but also has such poor instrumentation and customer communication that they are both unable to identify these issues as they happen and unwilling to communicate these issues to customers as they occur.

DO NOT USE AKS UNDER ANY CIRCUMSTANCES

DenisBiondic commented 5 years ago

We have been using AKS for some time now, and acs-engine before that (which is still running in a bigger project without issues so far). Lately, AKS is working fine, but we had our share of outages as well (DNS issue, nodes losing IP connectivity, master plane connectivity (INTERNAL_ERROR issue) which impacted performance etc etc.). To be honest, some of these were actually Kubernetes issues (like the DNS problem), but most issues came from the way how master and worker planes communicate in AKS.

Main problem I see is that support was not really all that helpful, and response times were far too long. I get we can buy different levels of support, but in case of Premier support the feeling was also not much different.

Another issue is that, irrelevant of the problem you have with AKS, the Azure status page is always green. I have no idea what they really monitor there.

However, instead of complaining about issues, I would really propose some measures that would go into direction of restoring faith in the system.

  1. I would really be glad to see public post mortems of the AKS team.
  2. It would really be interesting to see the internal AKS development online somewhere (like the k8s SIG meetings for example).
  3. It would be great to gain insights on how SRE around AKS works on Microsoft side, what you do for monitoring of customer clusters (what are your SLIs / SLOs?).
ghost commented 5 years ago

@DenisBiondic I agree with most of what you said except the complaining part. This "issue" is more about @contessa-zoey wanting to know if AKS is Prod ready. In most people's opinion, it is not. People are simply venting their frustrations and identifying known issues. I don't think as users we need to tell Microsoft to meet their SLAs. I also don't believe we need to tell a Microsoft what they need to do to fix AKS. They need to address the GitHub issues, meet their SLAs and be open with the community. Things they should already know.

They also have stats I'm sure which either show an increase or decrease in AKS uptake. This isn't some small start up, its a billion dollar company. GKE is a major selling point for Google Cloud, and im sure Microsoft wanted to compete in that space. Either they have a competitive product, or they do not. Right now, they do not.

DenisBiondic commented 5 years ago

@devkws yeah, I wrote that pretty fast, I didn't mean that people should not complain, but that it should not stay only at complaints (e.g. microsoft should do something about insights into SRE / uptime etc.) :) The complaining I am hearing for quite some time, not only @ github here...

erewok commented 5 years ago

We've been using AKS since the summer of 2018 and haven't had any of the issues mentioned in this repo aside from one upgrade failure which worked the second time it was applied (the upgrades are idempotent, I've been told).

@devkws when you say "in most people's opinion", you're referring to the small number of people who have commented here, right? I'll assume that because it's suspect and irresponsible to baldly state that in general "most people" think AKS isn't production ready without surveying directly the customers for this product.

In our case, we've been happy with it and have no plans of leaving.

In addition, I have been told by someone influential and central to the product that they have experienced huge growth ever since the product went GA. If that's the case, and if it truly isn't production-ready, I think you'd see a lot more complaints both here and elsewhere (including tech media). I'm sure there are some issues but I'd be willing to bet they're not fully representative of the product. Of course, I'm also unprepared to represent my opinions as representative without access to more information.

Lastly, I agree with every complaint about Azure support: it's utterly terrible and has never felt "worth it."

ghost commented 5 years ago

@erewok I am basing my statement anecdotally for sure. Though my sample size was quite large considering I didnt meet a single person at dockercon this year(when GA was announced) that suggested continuing with AKS which including docker captains and people from various walks of life, we had already made a large investment in AKS at that point. At most Azure meet ups where I live I can't say I have found a single person outside of Azure MVPs that suggest using it.

My comments were a little tongue and cheek, but from most people I speak too they suggest avoiding AKS still. I am sure AKS works great for some people or it wouldn't be around, and I don't doubt that they had a large uptake after GA. Where they stand now? That I am interested in.

To me, when AKS works great for most or all people, that is when it will be Prod ready. Comparing the reliability of GKE to AKS, well they aren't comparable.

Just my opinion.

DenisBiondic commented 5 years ago

I am suprised there is still no response from Microsoft on this thread.

I don't know if this is because noone wants to answer (a polite answer like please contribute with opening issues for problems you may have would suffice for me), or because this issue list is reviewed very rarely. I would expect from the AKS engineers that all of the issues here are tracked, reviewed, tagged properly and closed as soon as possible... A sum of 250 open issues either gives the impression that the repository is not maintained with a lot of love, or that you actually have 250 open issues (which is actually not true, many issues are simply questions around k8s usage, there are many duplicates, etc.) ;)

seanmck commented 5 years ago

Lead PM for AKS here. Thank you for all the feedback. Let me make a few comments on how we’re addressing what’s been discussed in this thread.

  1. For the last few months, the AKS team has been primarily focused on reliability to ensure that we can deal with the tremendous growth of the service. Our own telemetry and plenty of conversations with customers gives us confidence that this investment is paying off. There are certainly still issues, many of which are captured in this repo, but there are many large customers being successful with AKS on a daily basis.
  2. One of the consistent pain points listed throughout this thread has been poor customer support. This has also been a major focus for us in recent months, in terms of staffing and training support personnel, improving the diagnostic tools available for them, and streamlining the communication process with the AKS engineering team. As with reliability, our internal support KPIs and conversations with customers shows that this is improving, but there is still a lot more that we can do.
  3. One of the specific complaints here concerns customer communication of issues and RCAs. We are committed to making this better going forward, in a few ways. First, you will see timely updates on the Azure Status page for widespread issues with the service, along with portal notifications for more targeted awareness. In addition, we intend to do a better job of providing RCAs for customer support issues to provide insight into what happened and to offer some confidence that it won’t reoccur. Finally, we are looking to make better use of this repo for tracking known issues, with regular updates so that users can stay up to date on mitigations and resolutions to common problems. That being said, I do want to be clear that the support provided in this repo is best-effort and that if you’re having a business-impacting issue with AKS, you should file a support ticket immediately.

Hope that helps explain where we are and how we’re addressing some of these issues. The team is hard at work making the service better every day, so if you haven’t tried AKS in a few months, I’d encourage you to give it another spin.

DenisBiondic commented 5 years ago

Where criticism is due, so is praise. We got an issue solved in very short time with help from Microsoft Support (in this case the Standard Support plan). I wrote some details here: https://blog.coffeeapplied.com/azure-aks-dial-tcp-i-o-timeout-errors-and-help-from-microsoft-3f18a4a9ebcf

ghost commented 5 years ago

I’m glad they resolved your issue. However, this sounds like another AKS bug that they know about and haven’t fixed yet, which is why they were able to resolve it so quickly.

Denying outbound traffic via nsg should not stop the master from talking to workers.

DenisBiondic commented 5 years ago

Are you sure the kubelet never reports to master plane / initiates any connection? We essentially denied all outbout traffic (not only to internet). The issue was impacting commands such as kubectl logs / run / port-forward

ghost commented 5 years ago

I absolutely agree with your analysis. What I meant is that, this shouldn’t be allowed/happening in the first place and this sounds like a bug.

Part of a managed service means it shouldn’t be that easy to break a cluster. Denying out bound traffic via nsg “should” only stop internet traffic from worker nodes and not disrupt master and worker traffic.

DenisBiondic commented 5 years ago

Well, you can actually specify more detailed NSG rules, the issue was caused because we really denied all traffic to all sources.

We didn't try out denying all but allowing azure cloud in specific region for example, which would also be possible with a rule with service tag.

arashkaffamanesh commented 5 years ago

One simple annoying issue is, which prevents to import AKS clusters to Rancher for instance, or if we deploy Rancher on top of AKS, the controller manager and scheduler are shown as not healthy:

$ kubectl get componentstatus
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0               Healthy     {"health": "true"}

And a simple scale command even doesn't work:

az aks scale --resource-group kafka-dev-test-rg --name k8s-dev-test --node-count 3 --nodepool-name agentpool
Deployment failed. 
...
 Status=404 Code="DeploymentNotFound" Message="Deployment 'xxx' could not be found."

Is that a GA'ed production ready solution?