Closed ludydoo closed 3 years ago
@ludydoo We are in the process of rolling out 1.3.0 (currently rolled out in EastUS2Euap and CentralUSEuap). It will reach all production by ~05/07. We've made significant improvements for reliability and handing scale/perf with NPM some of the details are below and beefed up our investments in this space to make it first class, this has come a long way
Improvements:
Bugs fixed
Original behavior logic:
(Ingress rule 1 OR ingress rule 2 OR ingress rule 3 OR egress rule 1 OR Egress rule 2)
Current Behavior:
(Ingress rule 1 OR ingress rule 2 OR ingress rule 3) AND (egress rule 1 OR Egress rule 2)
We've also added Prometheus and azure monitor support for latency , iptables and ipset rules monitoring as well: https://docs.microsoft.com/en-us/azure/virtual-network/kubernetes-network-policies#monitor-and-visualize-network-configurations-with-azure-npm
Release Notes: https://github.com/Azure/azure-container-networking/releases
Can you share more about your setup details like Region, Subscription, AKS cluster details and Network policies to help you assist better?
@ludydoo I would also like to add we are also compliant with Ginkgo E2E conformance test-suite which is maintained by sig-network. We are currently in the process of also integrating with their cyclonus test framework (https://kubernetes.io/blog/2021/04/20/defining-networkpolicy-conformance-cni-providers/#cyclonus).
Hi @neaggarwMS
Thanks for the answer.
Our setup is as follows:
3 x AKS 1.20.2 clusters, westeurope region
When do you estimate GA of 1.3.0 for Westeurope region ? Also, is there a way we can fastrack this?
We have multiple networkpolicies.
I'll give a few examples:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: argo-server
namespace: argo
spec:
podSelector:
matchLabels:
app: argo-server
policyTypes:
- Ingress
- Egress
ingress:
# Allow ingress from VPN-IngressGateway
- ports:
- port: web
protocol: TCP
from:
- namespaceSelector:
matchLabels:
name: istio-system
podSelector:
matchExpressions:
- key: istio
operator: In
values:
- vpn-ingressgateway
# Allow ingress from Prometheus (metrics)
- ports:
- port: http-envoy-prom
protocol: TCP
- port: 15020
protocol: TCP
from:
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchExpressions:
- key: prometheus
operator: In
values:
- prometheus
egress:
- ports:
- port: 15012
protocol: TCP
to:
- namespaceSelector:
matchLabels:
name: istio-system
podSelector:
matchExpressions:
- key: istio
operator: In
values: [ pilot ]
- ports:
- protocol: UDP
port: 53
to:
- namespaceSelector:
matchLabels:
name: kube-system
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: minio
namespace: argo
spec:
podSelector:
matchLabels:
app: minio
ingress:
- ports:
- port: 9000
protocol: TCP
from:
# Allow Argo Server Ingress
- podSelector:
matchExpressions:
- key: app
operator: In
values: [ argo-server ]
namespaceSelector:
matchLabels:
name: argo
# Allow Ingress from Argo Workflows
- podSelector:
matchExpressions:
- key: workflows.argoproj.io/workflow
operator: Exists
namespaceSelector:
matchExpressions:
- key: name
operator: In
values: [ argo ]
- ports:
- port: 15020
protocol: TCP
from:
# Allow Prometheus Metrics Scraping
- podSelector:
matchExpressions:
- key: prometheus
operator: In
values: [ "prometheus" ]
namespaceSelector:
matchLabels:
name: monitoring
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: workflow-controller
spec:
podSelector:
matchLabels:
app: workflow-controller
policyTypes:
- Egress
- Ingress
ingress:
# Allow Prometheus Metrics Scraping
- ports:
- port: 15020
protocol: TCP
- port: 9090
protocol: TCP
from:
- podSelector:
matchExpressions:
- key: prometheus
operator: In
values: [ "prometheus" ]
namespaceSelector:
matchLabels:
name: monitoring
egress:
- ports:
- port: 15012
protocol: TCP
to:
- namespaceSelector:
matchLabels:
name: istio-system
podSelector:
matchExpressions:
- key: istio
operator: In
values: [ pilot ]
- ports:
- protocol: UDP
port: 53
to:
- namespaceSelector:
matchLabels:
name: kube-system
- ports:
- protocol: TCP
port: 8081
to:
- namespaceSelector:
matchLabels:
name: argo
podSelector:
matchExpressions:
- key: app
operator: In
values: [ repo-server ]
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: argo-server-egress-to-kubernetes-api
namespace: argo
spec:
podSelector:
matchLabels:
app: argo-server
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: <redacted_master_api_ip>/32
ports:
- protocol: TCP
port: 443
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: argo-workflow-controller-egress-to-kubernetes-api
namespace: argo
spec:
podSelector:
matchLabels:
app: workflow-controller
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: <redacted_master_api_ip>/32
ports:
- protocol: TCP
port: 443
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: minio-ingress-from-argo-workflow
namespace: argo
spec:
podSelector:
matchLabels:
app: minio
policyTypes:
- Ingress
ingress:
- ports:
- protocol: TCP
port: 9000
from:
- podSelector:
matchExpressions:
- key: "workflows.argoproj.io/workflow"
operator: Exists
namespaceSelector:
matchExpressions:
- key: name
operator: In
values: [ cicd ]
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: argo-server-egress-to-kubernetes-api
namespace: argo
spec:
podSelector:
matchLabels:
app: argo-server
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: <redacted_master_api_ip>/32
ports:
- protocol: TCP
port: 443
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: external-dns
namespace: external-dns
spec:
podSelector:
matchExpressions:
- key: app
operator: In
values:
- external-dns
- external-dns-private
policyTypes:
- Ingress
- Egress
ingress:
- ports:
- port: 15020
protocol: TCP
from:
- podSelector:
matchExpressions:
- key: prometheus
operator: In
values: [ prometheus ]
namespaceSelector:
matchLabels:
name: monitoring
egress:
- ports:
- port: 15012
protocol: TCP
to:
- namespaceSelector:
matchLabels:
name: istio-system
podSelector:
matchExpressions:
- key: istio
operator: In
values: [ pilot ]
- ports:
- protocol: UDP
port: 53
to:
- namespaceSelector:
matchLabels:
name: kube-system
- ports:
- port: 443
protocol: TCP
to:
- namespaceSelector:
matchLabels:
name: istio-system
podSelector:
matchExpressions:
- key: istio
operator: In
values: [ egressgateway-sni-proxy ]
@ludydoo Thank you for sharing the extensive list of NetworkPolicies. v1.3.1 is now being rolled out as we speak. Through safe deployment processes, we release the version to only a subset of regions on a given day. WestEurope should get this release by EOD Monday 3rd May. Once the cluster upgrades, please test if these issues are still persistent, if yes, you can either ping us here or create a support ticket with Azure Support. As Neha mentioned, with 1.3.1, we have added some reliability improvements which should remove flakiness with iptable rules and ipset lists.
@vakalapa I think we faced this same issue today, after 1 year in production with AKS & CNI & Network policies, pods were not able to connect each other. Kubernetes 1.17.9, North Europe. Removing network policies and rebooting azure-npm & coreDNS pods resolved, but is this ramdom issue still on-going is this version and region?
@tiholm, can you describe one of the Azure NPM pod and share its version? Also can you share the cluster details (Region, Subscription, Resource Group Name and Cluster Name)?
@neaggarwMS I have opened an support issue with id 2105040050002654 and discuss the private things there.
Sounds good, thanks @tiholm. We will look into that and get back on the case. We can close this github issue.
I believe we were experiencing this issue also in two clusters. One of the namespaces using egress network policies could not connect to services anymore. We could only "fix" it by removing all network policies.
There was a restart of the azure-npm pods around the time the errors first happened in both clusters.
Support request ID | 121050425001058
@BenjaminHerbert a new release of NPM v1.3.2 is deployed and the restart of NPM might be because of that. With this version, we are changing the behavior to be inline with upstreams rules evaluation. You can find details below: https://github.com/Azure/azure-container-networking/wiki/TSG:-NPM--v1.3.0-breaking-changes
If the egress is allowing the traffic and traffic gets dropped, there is a high probability that ingress of the destination is not allowing this traffic. Can you please evaluate all ingress rules being applied on the destination pod.
Thanks for the information.
We have a NetworkPolicy that should allow all egress traffic which used to work fine.
> kubectl get netpol egress-to-any -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
labels:
name: egress-to-any
manager: kubectl
operation: Update
time: "2021-05-07T11:55:47Z"
name: egress-to-any
namespace: abnahme-lowcode
spec:
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
- ipBlock:
cidr: ::/0
podSelector: {}
policyTypes:
- Egress
Since a few days, we are having problems opening connections.
NPM currently does not ipv6 addressing as it relies on ipv4 ipsets and iptables for rules. That might have not let this policy get applied on to the node, resulting in some traffic to be blocked. Please try to remove the ipv6 cidr block and apply the policy again. If that did not solve the issue, can you ask the support engineer working on your support ticket to escalate the issue to vakr@microsoft.com and we can internally debug more.
Thanks for the input. I checked it without the ipv6 cidr block and it still does not work. I asked to have our case assigned/to you. Thanks for your help!
@BenjaminHerbert thank you for the debugging session, as discussed, you are hitting this #870 known issue.
@ludydoo Are you still facing issue ? If so can you raise a support case and request to be escalated to vakr@microsoft.com We can help resolve. until that i will be closing this issue.
It seems that the issue is not present anymore
What happened:
We have 3 AKS clusters, 3 nodes each. Mix of B16 and D16 nodes.
We use NetworkPolicies to restrict traffic between pods. We define pretty strict rules (per-pod & namespace).
azure-npm
randomly causes the pod connections to fail, as it seems the IP table rules are getting corrupted.Sometimes, when I inspect the clusters in the morning, I have multiple pods failing due to
azure-npm
corrupting the iptables, without having changed or touched anything overnight.For example, we use argocd to manage our deployments.
argocd-server
will randomly fail to contact theargocd-repo-server
.Killing the
azure-npm
pods solves the problem. But this is not a viable solution.I sometimes see this error in
azure-npm
What you expected to happen:
azure-npm
to correctly define IP-table rules. We would expect AKS to have a bug-free CNI, as this is such a critical component of the infrastructure!I tried upgrading azure-npm to
1.3.0
, but it seems that AKS automatically manages this, and will downgrade to1.1.8
How to reproduce it:
Very hard to say. It sometimes happens when the labels on the pods/namespaces change. But also happen randomly. Help in debugging this would be greatly appreciated.
Orchestrator and Version (e.g. Kubernetes, Docker):
AKS 1.20.2
azure-npm
mcr.microsoft.com/containernetworking/azure-npm:v1.1.8`Operating System (Linux/Windows):
Linux
Kernel (e.g.
uanme -a
for Linux or$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion
for Windows):Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]