Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

Feature Request: User initiated restart of control plane #1506

Closed gsxmax closed 1 year ago

gsxmax commented 4 years ago

The priority always to be bringing the cluster back alive, and user has to create a case for restarting apiserver is a blocker timewise to do, could we support something that user can use(simple restarts) in time?

foobarnum commented 4 years ago

We continually have situations where a short term failure in lower level Azure infrastructure (CPU, RAM DISK, NETWORK) puts parts of the AKS control plane in a bad state that will not recover. An example would be the controller or scheduler unable to manage the nodes due to an Azure network or a DNS issue. Azure does not detect this class of issue reliably. We repeatedly end up with cluster in a bad, unrecoverable state when we attempt an operation that requires the control plane to function. Ex: kubectl logs -or- helm deploy. Our runbook for these failures is 2 steps. a) Contact Azure support, Priority 1, then b) tell them to restart the control plane. This works for all of these cases. For all of our sanity it would be better if we could simply do that ourselves. This goes into the calculus when we choose how much or effort to allocate to Azure generally.

esunder commented 4 years ago

I'd like to see this feature as well. I resolved two issues yesterday in two different clusters by simply having Azure support restart the Kubernetes API. If I could just have a button/api call to do this myself, it would save everyone some time.

djsly commented 4 years ago

Subscribing... The sad part is that its not even Azure Support that issues the restart...

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

miwithro commented 3 years ago

@gsxmax @foobarnum @esunder @djsly have you upgraded to at least K8s 1.18? We have made significant strides in the API server stability. I would be curious if you are still seeing that post 1.18.

djsly commented 3 years ago

@miwithro we have been on 1.18 for a while now, and we still get often cases where we would need to issue a rolling restart of the apiserver

miwithro commented 3 years ago

@juan-lee

miwithro commented 3 years ago

@djsly have you enabled Uptime SLA for your clusters? This will enable a higher level of HA/DR for the API Server which will mitigate the availability issues you have seen traditionally.

djsly commented 3 years ago

Yes, actually the high SLA is one of the reason why we need to initiate control plane restart since having multiple API server instances causes potential discrepancies between both instances with regards to caches.

On Aug 18, 2021, at 12:50 PM, miwithro @.***> wrote:

 @djsly have you enabled Uptime SLA for your clusters? This will enable a higher level of HA/DR for the API Server which will mitigate the availability issues you have seen traditionally.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

miwithro commented 3 years ago

Thanks @djsly that makes sense.

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

This issue will now be closed because it hasn't had any activity for 7 days after stale. gsxmax feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.