Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 307 forks source link

Network policy not applied at the very beginning of a pod lifecycle #2996

Open BzSpi opened 2 years ago

BzSpi commented 2 years ago

Hello,

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible): Use case is with Machine Learning that access a storage, this is how it seems to be configured (closes source image) Initial configuration:

Behavior:

Anything else we need to know?: Answer from the Microsoft support

From what I was been told, seems to be an expected behavior since that policy might not be taken immediately and runs after some time.

Environment:

ghost commented 2 years ago

Hi BzSpi, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

BzSpi commented 2 years ago

I went a bit further to reproduce this issue easily.

Here are the deployed configuration

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress: []
  egress:
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - port: 53
          protocol: UDP

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: myapp-replicaset
  labels:
    app: myapp
    type: dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      name: myapp-pod
      labels:
        app: myapp
        type: dev
    spec:
      containers:
      - name: nginx-container
        image: nginx
      initContainers:
      - name: init-myservice
        image: curlimages/curl:latest
        command: ['curl', '-s', '--fail', "http://ip.clara.net"]

I've tried this configuration with Azure network policy engine and the Calico one. This configuration has the expected behavior (aka, the pod does not start due to the init container failing) when using the Calico network policy engine. When using the Azure engine, the pod is starting even if during the pod lifecycle the egress is blocked.

IMHO, this is a serious security issue that needs immediate mitigation.

ghost commented 2 years ago

Triage required from @Azure/aks-pm

ghost commented 2 years ago

@immuzz, @justindavies would you be able to assist?

Issue Details
Hello, **What happened**: * My pod has access to a resource outside pod at the very beginning of its lifecycle despite a network policy blocking egress **What you expected to happen**: * Resource access is blocked by network policy **How to reproduce it (as minimally and precisely as possible)**: Use case is with Machine Learning that access a storage, this is how it seems to be configured (closes source image) Initial configuration: * Network policy blocking egress outside cluster * pod with init container that access resource outside cluster at the very beginning of its lifecycle (here, an Azure Storage Account) Behavior: * Resource outside cluster is accessible at the pod initialization (not expected) * Resource outside cluster is not accessible when trying to access it later (expected) **Anything else we need to know?**: Answer from the Microsoft support > From what I was been told, seems to be an expected behavior since that policy might not be taken immediately and runs after some time. **Environment**: - Kubernetes version: 1.23.5 - Size of cluster: 3 nodes - General description of workloads in the cluster: Azure Machine Learning models - Others: network policy mode is "azure" (not calico)
Author: BzSpi
Assignees: AbelHu, allyford
Labels: `bug`, `windows`
Milestone: -
AbelHu commented 2 years ago

AKS Windows still does not support network policy "azure" cc @allyford

huntergregory commented 2 years ago

We expect to release a new version of NPM next week which applies rules 10-20x faster than before.

Using the above config and your NPM version (assuming it is v1.4.21), logs show that NPM enforces the Network Policy within 19 milliseconds of receiving the pod's IP from API Server. Using the new version of NPM, policy enforcement occurs only 2 milliseconds after receiving the pod's IP. The pod still enters Running state.

NPM doesn't interfere with traffic unless an IP is blocked by Network Policies. Due to this design, we cannot block traffic for a pod the instant it comes up, i.e. before 2 milliseconds of receiving a Pod IP from API Server.

BzSpi commented 1 year ago

I tried with a fresh AKS with kubernetes version 1.25 and the issue is still present.

chzbrgr71 commented 1 year ago

@allyford checking to see if there is a resolution