Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] `NetworkPolicy` allowing DNS egress causes cilium agent crash in ACNS-enabled AKS #4525

Open felfa01 opened 1 week ago

felfa01 commented 1 week ago

Describe the bug When running an AKS cluster with Advanced Container Networking Services (ACNS) and deploying a NetworkPolicy configured to allow DNS egress, cilium agent pods are going into a crashing state.

To Reproduce

  1. Create an AKS cluster configured with the following:
    networkProfile: {
      advancedNetworking: {
        observability: {
          enabled: true
        }
        security: {
          fqdnPolicy: {
            enabled: true
          }
        }
      }
      networkPlugin: 'azure'
      networkPluginMode: 'overlay'
      networkDataplane: 'cilium'
      networkPolicy: 'cilium'
  2. Deploy a NetworkPolicy configured to allow egress to port 53 with protocol UDP.
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
    name: bad-netpol
    spec:
    egress:
    - to:
    - podSelector: {}
    - ports:
    - port: 53
      protocol: UDP
    policyTypes:
    - Egress
  3. run kubectl get pods -n kube-system and see that cilium pods are in a crashing state:
    cilium-2zchs                                           1/1     Running            0             47h
    cilium-bzj85                                           0/1     CrashLoopBackOff   21 (3s ago)   47h
    cilium-kp5qt                                           0/1     CrashLoopBackOff   23 (3s ago)   47h
    cilium-operator-5db9c9657b-k6j64                       1/1     Running            0             47h
    cilium-operator-5db9c9657b-sl2ss                       1/1     Running            0             47h
    cilium-xrdfm                                           1/1     Running            0             47h

Environment (please complete the following information):

Additional context Error log:

time="2024-09-05T12:29:14Z" level=info msg="NetworkPolicy successfully added" k8sApiVersion= k8sNetworkPolicyName=k6-enable-connection subsys=k8s-watcher
time="2024-09-05T12:29:14Z" level=info msg="Policy imported via API, recalculating..." policyAddRequest=c6eba7b9-6b84-486f-86a8-7ca94cc99486 policyRevision=23 subsys=daemon
time="2024-09-05T12:29:14Z" level=info msg="Sending Policy updates to sdp: endpoint_id:1500  port:53  rules:{selector_string:\"&LabelSelector{MatchLabels:map[string]string{any.k8s-app: kube-dns,k8s.io.kubernetes.pod.namespace: kube-system,},MatchExpressions:[]LabelSelectorRequirement{},}\"  port_rules:{match_pattern:\"*\"}  selections:39072}" subsys=fqdn/server
time="2024-09-05T12:29:14Z" level=info msg="Sending update to stream: &{0xc0029d81e0}" subsys=fqdn/server
time="2024-09-05T12:29:14Z" level=info msg="Updating the DNS rules for endpoint 1500" subsys=proxy
panic: runtime error: invalid memory address or nil pointer dereference
    panic: Trying to configure zero proxy port
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x2b3e3ec]

goroutine 1301 [running]:
github.com/cilium/cilium/pkg/proxy.(*Proxy).CreateOrUpdateRedirect.func1()
    /go/src/github.com/cilium/cilium/pkg/proxy/proxy.go:469 +0x6d
panic({0x2ff3920?, 0x5bfe340?})
    /usr/local/go/src/runtime/panic.go:920 +0x270
github.com/cilium/cilium/pkg/fqdn/service.(*FQDNDataServer).UpdateSDPAllowed(0xc001bdbf80, 0x5dc, 0x1110035, 0xc003a65770)
    /go/src/github.com/cilium/cilium/pkg/fqdn/service/service.go:61 +0x1ec
github.com/cilium/cilium/pkg/proxy.(*dnsRedirect).setRules(0xc001d11580, 0xc004a3f378?, 0xc003a65770)
    /go/src/github.com/cilium/cilium/pkg/proxy/dns.go:61 +0x217
github.com/cilium/cilium/pkg/proxy.(*dnsRedirect).UpdateRules(0xc001d11580, 0x3c933d0?)
    /go/src/github.com/cilium/cilium/pkg/proxy/dns.go:77 +0x2c
github.com/cilium/cilium/pkg/proxy.(*Proxy).CreateOrUpdateRedirect(0xc0005c7500, {0x3c896f0?, 0xc00084e6e0}, {0x3c933d0, 0xc00116c580}, {0xc003bf3d88, 0x12}, {0x3ca27a0, 0xc000b0aa80}, 0xc003fcb400)
    /go/src/github.com/cilium/cilium/pkg/proxy/proxy.go:503 +0x4d8
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirectsFromDesiredPolicy.func1(0xc004165380)
    /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:243 +0x166
github.com/cilium/cilium/pkg/policy.L4DirectionPolicy.updateRedirects({0xc003858de0?, 0x60?}, 0xc0042ca0c0, 0xc004a402f8, {0xc003a64cc0?, 0x0?, 0xc003a64cf0?})
    /go/src/github.com/cilium/cilium/pkg/policy/resolve.go:214 +0x196
github.com/cilium/cilium/pkg/policy.(*EndpointPolicy).UpdateRedirects(0x10?, 0xc0?, 0x4108c5?, {0xc003a64cc0?, 0x0?, 0xc003a64cf0?})
    /go/src/github.com/cilium/cilium/pkg/policy/resolve.go:199 +0x4d
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirectsFromDesiredPolicy(0xc000b0aa80, 0x0?, 0xc003a64a80, 0xc003fcb400)
    /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:217 +0x16d
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirects(0xc000b0aa80, 0xc003a64a50?)
    /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:419 +0x230
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0xc000b0aa80, 0xc00234a800, 0xc0038586c0)
    /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:840 +0x6d6
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc000b0aa80, 0xc00234a800)
    /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:544 +0x190
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc000b0aa80, 0xc00234a800)
    /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:472 +0x7b1
github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc000a4e080, 0xc000cb51a0?)
    /go/src/github.com/cilium/cilium/pkg/endpoint/events.go:57 +0x1de
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1()
    /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:245 +0x133
sync.(*Once).doSlow(0xc001104fd0?, 0x44591c?)
    /usr/local/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
    /usr/local/go/src/sync/once.go:65
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run(0xc001104f38?)
    /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:233 +0x3c
created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run in goroutine 1253
    /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:229 +0x69
felfa01 commented 1 week ago

@chasewilson FYI, I have noticed this with the ACNS feature.

tamilmani1989 commented 1 week ago

@felfa01 Thanks for reporting. we are looking into this.

vipul-21 commented 6 days ago

Thanks @felfa01. We were able to reproduce the issue on our end and are working on a fix for it. The issue is that there are two networkpolicies applied to the same endpoint. And one of those policy does not contain any dns rules. When we apply these 2 policies(in this case k6-enable-connection with DNS rules and bad-netpol without dns rules), cilium agent creates the DNS redirection for k6-enable-connection and tries to resuse the same redirection for bad-netpol policy. During policy recalculation since ACNS feature currently only supports DNS based policies, it starts failing the cilium agent because of dns policy being nil for the bad-netpol.

avo-sepp commented 3 days ago

We came across this problem naturally running in one of our clusters. I can confirm this bug exists and the short-term fix is to remove the NetworkPolicy that has a DNS egress on it.

vipul-21 commented 3 days ago

Confirming that the short term fix is to remove the NetworkPolicy that has a DNS egress specified.