[IMPORTANT] AKS PG Requesting Customer Insights: Feedback on AKS Troubleshooting

julia-yin commented 1 month ago

The AKS Product Group is seeking customer feedback to improve the AKS troubleshooting experience. Our goal is to understand how customers are troubleshooting and identify the common challenges/pain points with the troubleshooting experience today. We would love to hear from you in this thread with regards to the following:

Your current troubleshooting methods and tools, especially for complex problems (ex: node, networking issues)
Major pain points or frustrations with the experience today
Any suggestions for improvements

Your input is crucial in making our troubleshooting offerings better for everyone. Thank you for your valuable feedback!

Best, Julia Yin Product Manager on AKS

PixelRobots commented 1 month ago

Today I was troubleshooting a potential DNS issue within a cluster. I ended up using a debug pod to test some stuff.

It would be nice if it was easier to create a debug container via the Azure portal with a debug container image that Azure manages with tools to help troubleshoot basic issues all via the Azure portal. Kind of like the run command.

This could mean users without much kubectl knowledge could troubleshoot straight from within Azure.

julia-yin commented 1 month ago

Hi @PixelRobots, thank you for sharing feedback! I have a few questions if you don't mind elaborating:

Where do you typically start the troubleshooting process, in the CLI or Portal? Do you prefer troubleshooting in one place or another and why?
What does your debug pod look like, and what steps did you take with it to debug the DNS issue in your cluster?
Are there any examples of things you find frustrating or confusing about the current CLI experience (kubectl)?

ma-ts commented 1 month ago

Hi @julia-yin, thanks so much!

We're a heavy AKS user, also working with many private and public preview features. To answer your questions:

We make heavy use of debug pods, either sharing the pod memory, or scheduled on the node itself, depending on what the issue is. We schedule these through regular Kubernetes API actions (so kubectl).
For networking issues it depends a bit what the issue is: if the issue is within the cluster, we use Hubble to observe network traffic. If it is outside of the cluster, we can use the observability we have available from our internal Checkpoint firewalls. We don't use Azure-native network observability for this.
For DNS issues from AKS we almost always just enable debug logging for CoreDNS, and that gets us quite far (and often we combine it with Hubble for the network paths).

There's a couple of things that we are missing that are difficult right now:

When the Kubernetes API server is unavailable (which sometimes happens for the workloads that are a part of the API Server VNet Integration Public Preview), it is almost impossible to get information out of the nodes (because we cannot schedule a debug pod on the node). The only way is to use the run command functionality of the Virtual Machine Scale Set. It would be very helpful if there was a way to interactively work with the nodes from the Azure control plane (given that you have the right authorization)
We cannot see the traffic / logs of the Kubernetes API server or from etcd at all. The only way to get these are to enable the Diagnostic Logs, however these are very delayed and often makes it hard to debug issues when they arise.
It is impossible for us to see any logs about how the Azure resources are provisioned internally. For instance, we had a weird issue recently that resulted us not being able to delete clusters (` "statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"InternalOperationError\",\"message\":\"PrivateConnectCleanupReconciler retry timed out: %!w()\"}]}}",. We actually needed a support engineer who had access to additional logs in order to see what was actually going on (and unblock the deletes).

julia-yin commented 1 month ago

Hi @ma-ts, really appreciate the detailed explanation of your current troubleshooting methods and feedback. Some further questions for you if you don't mind elaborating:

Your team uses multiple different methods for debugging various types of issues, such as node/networking/DNS. Once an issue is detecting from within your AKS cluster, how will your team then proceed to narrow down possible sources and determine the tools needed?
Does your team also conduct networking monitoring, such as using the Hubble metric & monitoring functionality to store Hubble data?
How does your team typically prefer to view and debug logs (ex. for DNS)?

JoeyC-Dev commented 1 month ago

Presenting syslog with core pods log (calico, ip-masq, etc...) will be useful when specifically facing network issue under strict networking policy. (E.g. calico port block, not allowing mcr.microsoft.com, API server certificate expired, SP expired, etc.) Basically, all kinds of these errors can have a trace in syslog or relative core logs. It is not possible to help customer categorize all possible root cause, but it will be a direction to help them if you can gather these in one page.

Azure / AKS

[IMPORTANT] AKS PG Requesting Customer Insights: Feedback on AKS Troubleshooting #4206