cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20.39k stars 2.99k forks source link

Valid CIDR Identities are getting released upon Cilium Agent restart #27210

Closed carnerito closed 1 year ago

carnerito commented 1 year ago

Is there an existing issue for this?

What happened?

I have CililiumClusterwideNetworkPolicy deployed and Host Firewall enabled. Among other CCNPs that allow Cilium to function normally, I have CCNP that allows access from trusted subnets (private network, home VPN, etc...) and it looks like this:

apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: "whitelisted-cidrs"
spec:
  description: "Allow access from trusted CIDRs"
  nodeSelector:
    matchLabels: {}
  ingress:
  - fromCIDR:
    - "127.0.0.0/8"
    - "10.0.0.0/8"
    - "172.16.0.0/12"
    - "192.168.0.0/16"
    - "<trusted public ip>:32"

When this policy is applied for the first time everything works as expected. Problem occur 10 minutes after Cilium agent is restarted and is manifested by loosing connectivity to cluster from addresses defined in whitelisted-cidrs CCNP.

Since this is huge problem for me because I'm heavily relaying on Cilium Host Firewall feature, I did some investigation and these are my findings:

After I figured out what is happening, I reverted logic in releaseIdentity to look like this:

    releaseIdentity:
        if entryExists {
            // ...
            if _, ok := idsToAdd[oldID.ID]; !
                previouslyAllocatedIdentities[prefix] = oldID
            }

            // ...
            if prefixInfo == nil && oldID.createdFromMetadata {
                entriesToDelete[prefix] = oldID
            }
        }

This change solved the issue, but since I'm new to Cilium code base, I'm not sure will this have some unwanted side effects. If this change looks good, I can submit PR.

How to reproduce in local Kind cluster

extraArgs:
  - '--identity-restore-grace-period'
  - '2m'
hostFirewall:
  enabled: "true"

Cilium Version

v1.14.0

Kernel Version

.

Kubernetes Version

v1.27.3

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

kimma commented 1 year ago

Chiming in to say i'm having the same problems. Looks like any use of fromCIDR or toCIDR in CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy impossible due to this.

The following CiliumNetworkPolicy:

    - fromCIDR:
        - 10.0.0.0/8
        - 192.168.0.0/16
        - 2a0e:97c0:250::/44
        - 2a02:2c8:f000::/48
        - fd00::/8
      toPorts:
        - ports:
          - port: "8443"
            protocol: TCP
          - port: "8000"
            protocol: TCP

Should allow the following traffic, (and has worked flawlessly until upgrade) but it is now denied.

Ethernet    {Contents=[..14..] Payload=[..50..] SrcMAC=b6:6a:89:17:4f:7a DstMAC=ae:9d:38:b9:4a:77 EthernetType=IPv4 Length=0}
IPv4    {Contents=[..20..] Payload=[..28..] Version=4 IHL=5 TOS=0 Length=48 Id=0 Flags=DF FragOffset=0 TTL=61 Protocol=TCP Checksum=60901 SrcIP=10.12.123.10 DstIP=10.12.192.192 Options=[] Padding=[]}
TCP {Contents=[..28..] Payload=[] SrcPort=53009 DstPort=8443(pcsync-https) Seq=3964054339 Ack=0 DataOffset=7 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=65535 Checksum=44278 Urgent=0 Options=[TCPOption(MSS:1380 0x0564), TCPOption(SACKPermitted:), TCPOption(EndList:)] Padding=[0]}
CPU 08: MARK 0xc881a88a FROM 3257 DROP: 62 bytes, reason Policy denied, identity 16777228->136429, to endpoint 3257
jwitko commented 1 year ago

I think this is the same or very similar to https://github.com/cilium/cilium/issues/27176 https://github.com/cilium/cilium/issues/27210

I have the same issue and basically any change to policies triggers this for me at this point. The nodes assume a new identity within cilium that isn't properly marked as host/remote-node.

This seems to get the physical servers wedged and I cannot talk other nodes until I stop/start the host interfaces. When I do that the nodes seem to assume a properly marked identity as host or remote-node as appropriate. This can become quite the burden if hostFirewall is enabled as the identifying CIDRs that I SSH from can also lose their IDs and therefore don't get picked up on the cidr:10.0.0.0/8 label.

This also does nothing to fix all the broken/lost IDs for non-hosts. Those are typically stuck in terminating. If I force terminate I am left with the kublet error: Warning FailedCreatePodSandBox 28s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0869e8a2724dbc910f2ea02fadb47297a1f017663725e3e0832d5af79ff9db68": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

There is also an issue with the floating IP that load balancers between the controlplane nodes via kube-vip. Restarting interfaces does not reset the identity of this IP most likely because it does not belong to any interface directly.

joestringer commented 1 year ago

Hi folks, thanks for the detailed reports and bisection. I've prepared a PRs to solve these issues, and it appears to locally solve the problem for me. If you are able to test out the fix as well, then that could help for us to confirm that the fixes are resolving your reported issues and not introducing other unintended breakages.

The following hotfix images are also available for testing out this fix.

v1 (https://github.com/cilium/cilium/pull/27305)

https://github.com/cilium/cilium/actions/runs/5778136016/job/15659097948

quay.io/cilium/cilium-dev:cidr-identity-refcnt-fix@sha256:0f3300334b7b4b86042cebe9271d959df8066acfd04f48bb9cdee076965bb946
quay.io/cilium/clustermesh-apiserver-dev:cidr-identity-refcnt-fix@sha256:3d76b0d8d985c1569197b7feb4aceb2062d2fe50df6811183a9b8a835fe5abc7
quay.io/cilium/docker-plugin-dev:cidr-identity-refcnt-fix@sha256:0c19480912a38943e2703cd2c1772fbc451778f260816c6a058ef3e8e5158314
quay.io/cilium/hubble-relay-dev:cidr-identity-refcnt-fix@sha256:1be4a8144c244002ac2c413b0be95e3f54d43ffe67dab2983d3f126f4938e71b
quay.io/cilium/kvstoremesh-dev:cidr-identity-refcnt-fix@sha256:8ca77f95995093e8cf0cd5176bb1bcbf32711393ab63e4b426509b2ef4c6883d
quay.io/cilium/operator-alibabacloud-dev:cidr-identity-refcnt-fix@sha256:86bd658544c51f13c105d3fdf7c2413c3fe423c07e1f1d2e858ac57206ed6692
quay.io/cilium/operator-aws-dev:cidr-identity-refcnt-fix@sha256:3dcd0686f799025673dda47665c7471c0036ae6fabe43241d3806f8bc118da50
quay.io/cilium/operator-azure-dev:cidr-identity-refcnt-fix@sha256:edc8c7bdf0913ff7c6f0d5df4a695322a3d65e9be7dc103869d3735e25e326ba
quay.io/cilium/operator-generic-dev:cidr-identity-refcnt-fix@sha256:e0506e8e70e401521ebff51e82f7dbaca625bb5c043d0bf1ed0553fb7c4ae2f7
quay.io/cilium/operator-dev:cidr-identity-refcnt-fix@sha256:6fc76b07e4a59e3bf9eae8c5e8adca9df416bb8c976dc0144cd478b22ca67eaa

v2 (https://github.com/cilium/cilium/pull/27327)

https://github.com/cilium/cilium/actions/runs/5791358233/job/15696278428

quay.io/cilium/cilium-dev:v1.14-with-27327@sha256:c23bbd046e64f997c62393b6fd32fbf09bd29359b1d3933b7b40f70c6885f1ba quay.io/cilium/clustermesh-apiserver-dev:v1.14-with-27327@sha256:183aaff8960a7379e03a8af7d1ad94c9c847c5f0a90c2fbf3dec5214d593a967 quay.io/cilium/docker-plugin-dev:v1.14-with-27327@sha256:064e07ff96216c47f2f8aa67ff59da0d801a2a6f5c729a104f380527c2728bd4 quay.io/cilium/hubble-relay-dev:v1.14-with-27327@sha256:104d6244a5a1799ce6b5a3b6934a635f50270b8b1da648a2553ffc7f85fd58e5 quay.io/cilium/kvstoremesh-dev:v1.14-with-27327@sha256:c09dada89ddaa73840aff880b25de8d76f603d73949fedad1940bff1d42e3559 quay.io/cilium/operator-alibabacloud-dev:v1.14-with-27327@sha256:1e58c6748692793265412b4c3f2f473f709efba7179aa0e3996b4afafb1e919f quay.io/cilium/operator-aws-dev:v1.14-with-27327@sha256:4a02d1511c02484ff11949fc83c5d196ca0114b4427cf67a7494c8a0df22faee quay.io/cilium/operator-azure-dev:v1.14-with-27327@sha256:d89395b946746bff48162ade4272b59c1f68575ea8638ca0d57bc3f013525091 quay.io/cilium/operator-generic-dev:v1.14-with-27327@sha256:8add96d4c4fb4c9639355eaef04058b15e606fdc56a9ca7bf915b15682e11045 quay.io/cilium/operator-dev:v1.14-with-27327@sha256:8064961869571b2e9faac59480812c83efb87dc4ee1b337dc5145c9126db1877

EDIT: Updated for v2

kimma commented 1 year ago

Has been running now for over 30 minutes without any issues with toCIDR/fromCIDR, so looking good from my end at least!

cilium-gtqsf                              1/1     Running   0              36m
cilium-s8fg7                              1/1     Running   0              35m
cilium-ztkgd                              1/1     Running   0              36m
carnerito commented 1 year ago

I'm confirming what @kimma said. Looking good now. I've set identity-restore-grace-period to 2m, all toCIDR/fromCIDR identities are surviving agent restarts and the cluster is operational after a grace period expires.

jwitko commented 1 year ago

Additionally confirmed. Set identity-restore-grace-period to 5m. Pods have been up 20+ minutes and no issues. Deny logs do not show anything unusual, pod logs all look happy.

joestringer commented 1 year ago

I've prepared fresh images with my v2 via https://github.com/cilium/cilium/pull/27327, see above.