Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[IMPORTANT] AKS PG Requesting Customer Insights: Feedback on AKS Troubleshooting #4206

Open julia-yin opened 1 month ago

julia-yin commented 1 month ago

The AKS Product Group is seeking customer feedback to improve the AKS troubleshooting experience. Our goal is to understand how customers are troubleshooting and identify the common challenges/pain points with the troubleshooting experience today. We would love to hear from you in this thread with regards to the following:

  1. Your current troubleshooting methods and tools, especially for complex problems (ex: node, networking issues)
  2. Major pain points or frustrations with the experience today
  3. Any suggestions for improvements

Your input is crucial in making our troubleshooting offerings better for everyone. Thank you for your valuable feedback!

Best, Julia Yin Product Manager on AKS

PixelRobots commented 1 month ago

Today I was troubleshooting a potential DNS issue within a cluster. I ended up using a debug pod to test some stuff.

It would be nice if it was easier to create a debug container via the Azure portal with a debug container image that Azure manages with tools to help troubleshoot basic issues all via the Azure portal. Kind of like the run command.

This could mean users without much kubectl knowledge could troubleshoot straight from within Azure.

julia-yin commented 1 month ago

Hi @PixelRobots, thank you for sharing feedback! I have a few questions if you don't mind elaborating:

  1. Where do you typically start the troubleshooting process, in the CLI or Portal? Do you prefer troubleshooting in one place or another and why?
  2. What does your debug pod look like, and what steps did you take with it to debug the DNS issue in your cluster?
  3. Are there any examples of things you find frustrating or confusing about the current CLI experience (kubectl)?
ma-ts commented 1 month ago

Hi @julia-yin, thanks so much!

We're a heavy AKS user, also working with many private and public preview features. To answer your questions:

There's a couple of things that we are missing that are difficult right now:

julia-yin commented 1 month ago

Hi @ma-ts, really appreciate the detailed explanation of your current troubleshooting methods and feedback. Some further questions for you if you don't mind elaborating:

  1. Your team uses multiple different methods for debugging various types of issues, such as node/networking/DNS. Once an issue is detecting from within your AKS cluster, how will your team then proceed to narrow down possible sources and determine the tools needed?
  2. Does your team also conduct networking monitoring, such as using the Hubble metric & monitoring functionality to store Hubble data?
  3. How does your team typically prefer to view and debug logs (ex. for DNS)?
JoeyC-Dev commented 1 month ago

Presenting syslog with core pods log (calico, ip-masq, etc...) will be useful when specifically facing network issue under strict networking policy. (E.g. calico port block, not allowing mcr.microsoft.com, API server certificate expired, SP expired, etc.) Basically, all kinds of these errors can have a trace in syslog or relative core logs. It is not possible to help customer categorize all possible root cause, but it will be a direction to help them if you can gather these in one page.