Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

AKS network policies with azure-npm:v1.1.8 doesn't seem to be production ready #2210

Closed HansK-p closed 3 years ago

HansK-p commented 3 years ago

What happened: There were network timeout issues affecting a namespace where Network Policies are applied. In this namespace there are only ingress network policies, but egress traffic was also affected when the incident hit us. This means that DNS lookups timed out, access to external and internal URLs timed out and access from the Nginx Ingress to the service timed out.

Removing and/or reapplying both network policies did not help. Forcing the pod in this namespace to start on another node did not help.

The incident was solved by the following command: kubectl rollout restart daemonset azure-npm -n kube-system

This incident happened directly after or short time after an upgrade from AKS 1.18.10 to 1.19.7, but I suspect this isn't relevant. The issue seems to be that we simply can't use Azure AKS Network policies for critical workloads as it isn't production ready. This is an old problem and the situation is a lot better now than it was a year ago,

What you expected to happen: I expect AKS Azure Network policies to reach production readiness. I'm using Calico at home and Calico network policies have been stable (= production ready) for years.

How to reproduce it (as minimally and precisely as possible): To me it looks like it's sufficient to apply network policies in a namespace and wait. Within 6 months there will be at least one incident.

Anything else we need to know?: It is not really relevant, but....

We had another odd Network Policy issue in a new cluster around two weeks back. That incident was solved by doing a "kubectl rollout restart" on the ingress controller that wasn't able to reach the backend service. During that incident, deleting network policies helped, but reapplying those network policies killed the communication again. We did not try to restart the azure-npm daemonset.

The affected network policies were in the target namespace and not in the Ingress namespace.

In general I've started to really like AKS. These Network Policy issues represents the last nasty bugs I'd really like the AKS/Azure team to sort out so that we can start using network policies for critical workloads.

Environment:

ghost commented 3 years ago

Hi HansK-p, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 3 years ago

Triage required from @Azure/aks-pm

dstrebel commented 3 years ago

@HansK-p Any reason why you would not want to use Calico Network Policy with AKS?

HansK-p commented 3 years ago

That's actually a good question. I had the belief that Calico wasn't supported with Azure CNI, but I see that it is.

The reason not to use Calico now is that it is not officially supported by Azure/Microsoft as I understand it. From https://docs.microsoft.com/en-us/azure/aks/use-network-policies:

So it feels a bit scary to go for the Calico option (even though I'd guess it works a lot better). I'm also a bit worried that AKS addons will not support Calico network policies out of the box, but will support Azure networking policies (but not sure if that is actually something to worry about).

So my preferred option is that Microsoft spend some time and resources on stabilizing Azure networking policies - or deprecate Azure networking policies and start fully supporting Calico networking policies.

HansK-p commented 3 years ago

Mostly FYI. Today I did a "kubectl rollout restart daemonset azure-npm -n kube-system" in order to solve a Network policies issue in one of our non-prod clusters.

So right now the following seems to be the universal way to solve incidents created by this problem:

kubectl rollout restart daemonset azure-npm -n kube-system

But it has not been validated.

HansK-p commented 3 years ago

And had to restart those azure-npm pods also yesterday in a prod environment. I'm not going to write an entry here for each time I have to restart those pods. This is just to give an indication that Azure Network Polices is unstable in our environment, for more than one cluster and I doubt that our setup itself is the reason it is unstable.

ghost commented 3 years ago

Action required from @Azure/aks-pm

ghost commented 3 years ago

Issue needing attention of @Azure/aks-leads

danquah commented 3 years ago

AKS version: v1.18.10 Azure-npm: v1.1.8

Just wanted to pitch in with a more or less identical experience. We quite often experience that two pods that are

Will loose network-connectivity independent of eachother. That is, one pod can do dns requests against the internal coredns while the other can't. We can usually resolve the issue by either restarting azure-npm, or restarting the node. If we have the issue with one pod, we usually have the same issue for all pods on the same node that have network policies applied.

weinong commented 3 years ago

likely related to https://github.com/Azure/azure-container-networking/issues/854

HansK-p commented 3 years ago

It might be related to the issue mentioned. I must admit that my biggest worry is that the Azure implementation of Network Policies doesn't seem production ready, and I don't know if it will ever become "production ready". And this even if it is the only supported version in that we can get support from Microsoft (in my understanding).

The nice Youtube video Part I of III: Configure your AKS Cluster with Confidence was release a few days ago, and there I don't even think they mentioned Azure network policies as an option. This video is created by a "group within Microsoft Solutions focused on Cloud Native technologies" and it might be telling that they seems to have ignored Azure Network Policies when they presented their opinionated view as to how an AKS cluster should be configured.

I suppose I'm searching for the right person telling me that Azure Network Policies is actively worked on and that it will be production ready within not too long - or something else.

danquah commented 3 years ago

Sounds related @weinong ! - do you know anything about how/when this fix will be rolled out?

danquah commented 3 years ago

@HansK-p with regards to the video (it looks good, I'll have to do a full walkthrough) - you're wondering why it only mentions calico for network policies and not azure-npm right?

HansK-p commented 3 years ago

I'm mostly worried. About a year ago, Azure network policies were simply hopeless (imho). Now they are a lot better and we even considered using them in production, but they are simply not production ready (is our experience) and with really basic issues. We don't do anything advanced, we don't have anything special in our clusters, we have very standard rulesets - and still Azure Networik policies are painfully unreliable.

That worries me.

weinong commented 3 years ago

@vakalapa can you share the rollout plan for https://github.com/Azure/azure-container-networking/issues/854

chandanAggarwal commented 3 years ago

@HansK-p we've made significant improvements for reliability and handing scale/perf with NPM some of the details are below and beefed up our investments in this space to make it first class, this has come a long way

We're in process of rolling out 1.30 broadly ( it has rolled out our canary regions ), it includes bunch of fixes on top of existing investments we're making and some of the other improvements are planned for 1.3,1.

Improvements

Bugs fixed -NPM does not exclude host network Pods from the network policies resulting in blocking of traffic and disruption in system functions such as collection of kubectl logs. -NPM now supports NameSpace label updates -Changing rule evaluation behavior to (INGRESS and EGRESS) -Before this change, NPM would have allowed traffic if there is a single “ACCEPT” rule in either ingress or egress direction. With this change, NPM evaluates both ingress and egress rules to take a decision on the packet.

Original behavior logic:

(Ingress rule 1 OR ingress rule 2 OR ingress rule 3 OR egress rule 1 OR Egress rule 2)

Current Behavior:

(Ingress rule 1 OR ingress rule 2 OR ingress rule 3) AND (egress rule 1 OR Egress rule 2)

We've also added Prometheus and azure monitor support for latency , iptables and ipset rules monitoring as well: https://docs.microsoft.com/en-us/azure/virtual-network/kubernetes-network-policies#monitor-and-visualize-network-configurations-with-azure-npm

You can check release page for more details of commits etc. https://github.com/Azure/azure-container-networking/releases

HansK-p commented 3 years ago

It is good to hear that there is a lot of work going on in order to make Azure Network Policies production ready.

Approximately when do you expect Azure Network Policies to have reach production quality? I'm asking as we are raising our focus on security and AKS. Applying Network Polices around applications does add a lot of security (event though it is only a piece in the puzzle).

neaggarwMS commented 3 years ago

@HansK-p

Can you share the region so that we can better assist you with timelines for 1.3.0 rollout in that region?

Azure platform is continuously improving the quality of its stack and as @chandanAggarwal highlighted above we are rolling out huge reliability improvements in NPM.

Along with that we are also conformant with Kubernetes Ginkgo E2E tests for Network policies which is maintained by Sig-network (https://github.com/kubernetes/kubernetes/tree/master/test/e2e/network/netpol)

Currently we are also working on integrating the cyclonus test coverage with Azure NPM. https://kubernetes.io/blog/2021/04/20/defining-networkpolicy-conformance-cni-providers/#cyclonus

HansK-p commented 3 years ago

I like that you are really working to solve remaining issues with Azure Network policies :).

We have permanent AKS clusters in:

To me it sounds like it should be safe to apply Azure Network Policies in production after Summer Vacation 2021. Maybe earlier, but we need to have our things running stable in non-production for a few weeks before we start applying network policies to (critical) production infrastructure.

ghost commented 3 years ago

Triage required from @Azure/aks-pm

neaggarwMS commented 3 years ago

@HansK-p

Internally we have our own non-production regions (USeast2Euap and CentralUSEuap) and like any other service we ensure NPM is stabilized in that region before we start rolling out broadly.

Post our non-production regions, NPM Deployment will cover all production regions by following a safe rollout plan (deploying in few regions at a time). We cant gurantee the sequence for your non-production and production clusters. NPM will pick the latest build after you reconcile your cluster (which could occur on any update operation on AKS cluster).

which region is your non-production workload running? Can you share region and subscription details? We can see internally if we can deploy latest NPM in that subscription first to give you some room to validate before it rolls out in production.

HansK-p commented 3 years ago

Hello and thank for the offer

The only place we currently ahve production workloads are in West Europe and we do not use Network polices for production workloads, except for some minor internal utilities we use ourselves.

Our current plan is to a few non-importat applications protected by network policies. And ten expenad it's usage when we think it is safe.

What I have gotten out of this ticket is that there should be no need to switch to other means of getting the same protection as we expect to get from Azure Network Policies (Calico, Service Mesh, possibly other means). And that it is expected that Azure Netowrk Polcies is expected to reach production maturity within a few months.

If it sounds ok, I'll create a support ticket if we still have issues when we reach June.