cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.83k stars 2.91k forks source link

Cilium identity not correct when trying connect from outside cluster with nodeport, GENEVE + DSR #33906

Closed zzbmmbzz closed 1 month ago

zzbmmbzz commented 2 months ago

Is there an existing issue for this?

What happened?

I applied cilium network policy and got happen when trying call to service inside cluster from outside with NodePort

Details: I have deployed service netpol2-nginx on node 10.30.80.140 and expose node port 443:32044 And the Cilium network Policy allow ingress from 10.164.33.200/32 call to netpol2-nginx port 443

Service: netpol2-nginx deployed on node 10.30.80.140 expose node port 32044/TCP
---
#Policy:
endpointSelector:
  matchLabels:
    workload-selector: netpol2-nginx
ingress:
  - fromCIDRSet:
      - cidr: 10.164.33.200/32
    toPorts:
      - ports:
          - port: "443"
  - fromEntities:
      - cluster

The happen is:

# case 1: Connect from 10.164.33.200 to 10.30.80.140 port 32044 -> success

# cilium monitor -vvv | grep 10.164.33.200 on node 10.30.80.140
Policy verdict log: flow 0x0 local EP ID 158, remote ID 16777221, proto 6, ingress, action allow, auth: disabled, match L3-L4, 10.164.33.200:57293 -> 172.16.17.60:443 tcp SYN
-> endpoint 158 flow 0x0 , identity 16777221->70150 state new ifindex lxc7f4967ddf9c2 orig-ip 10.164.33.200: 10.164.33.200:57293 -> 172.16.17.60:443 tcp SYN
-> network flow 0x8233f23e , identity 70150->16777221 state reply ifindex 0 orig-ip 0.0.0.0: 172.16.17.60:443 -> 10.164.33.200:57293 tcp SYN, ACK
-> endpoint 158 flow 0x0 , identity 16777221->70150 state established ifindex lxc7f4967ddf9c2 orig-ip 10.164.33.200: 10.164.33.200:57293 -> 172.16.17.60:443 tcp ACK
# case 2: Connect from 10.164.33.200 to another k8s cluster node (eg. 10.30.80.157) port 32044 -> timeout

# cilium monitor -vvv | grep 10.164.33.200 on node 10.30.80.157
-> overlay flow 0x0 , identity world->70150 state new ifindex 0 orig-ip 0.0.0.0: 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
-> overlay flow 0x0 , identity world->70150 state new ifindex 0 orig-ip 0.0.0.0: 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
-> overlay flow 0x0 , identity world->70150 state new ifindex 0 orig-ip 0.0.0.0: 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN

# cilium monitor -vvv | grep 10.164.33.200 on node 10.30.80.140
Policy verdict log: flow 0x0 local EP ID 158, remote ID world, proto 6, ingress, action deny, auth: disabled, match none, 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 158, ifindex 5, file bpf_lxc.c:1972, , identity world->70150: 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
Policy verdict log: flow 0x0 local EP ID 158, remote ID world, proto 6, ingress, action deny, auth: disabled, match none, 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 158, ifindex 5, file bpf_lxc.c:1972, , identity world->70150: 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
Policy verdict log: flow 0x0 local EP ID 158, remote ID world, proto 6, ingress, action deny, auth: disabled, match none, 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 158, ifindex 5, file bpf_lxc.c:1972, , identity world->70150: 10.164.33.200:45377 -> 172.16.17.60:443 tcp SYN

In case 2 the source identity looks like not correct ( identity 2 - world), and expectation is 16777221

As there any solution for this issue?

Thanks

How can we reproduce the issue?

  1. Run apply Cilium network policy
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: ingress
  namespace: netpol-2
spec:
  ingress:
    - fromCIDRSet:
        - cidr: 10.164.33.200/32
      toPorts:
        - ports:
            - port: "443"
    - fromEntities:
        - cluster
  1. Deploy Services netpol2-nginx and expose node port
    ---
    apiVersion: v1
    kind: Pod
    metadata:
    labels:
    workload-selector: netpol2-nginx
    name: netpol2-nginx
    namespace: netpol-2
    spec:
    containers:
    - image: nginx:alpine
    name: nginx
    ports:
      - name: http
        containerPort: 80
      - name: https
        containerPort: 443
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: netpol2-nginx
    namespace: netpol-2
    spec:
    ports:
    - name: http
    port: 80
    protocol: TCP
    targetPort: 80
    nodePort: 32081
    - port: 443
    name: https
    protocol: TCP
    targetPort: 443
    nodePort: 32044
    selector:
    workload-selector: netpol2-nginx
    type: NodePort

SSH to 10.164.33.200 and trying to connect

# telnet 10.30.80.140 32044
Trying 10.30.80.140...
Connected to 10.30.80.140.
Escape character is '^]'.
^CConnection closed by foreign host

# telnet 10.30.80.157 32044
Trying 10.30.80.157...
telnet: connect to address 10.30.80.157: Connection timed out

Cilium helm values

bpf:
  hostLegacyRouting: false
  masquerade: true
cluster:
  id: 1
  name: dev-cluster
cni:
  chainingMode: none
  exclusive: false
devices: k8s0
extraConfig:
  api-rate-limit: endpoint-create=rate-limit:5/s
global:
  clusterCIDR: 172.16.0.0/18
  clusterCIDRv4: 172.16.0.0/18
  clusterDNS: 172.16.64.10
  clusterDomain: cluster.local
  rke2DataDir: /data/rancher/rke2
  serviceCIDR: 172.16.64.0/18
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true
k8sServiceHost: 10.30.80.250
k8sServicePort: "6443"
kubeProxyReplacement: strict
loadBalancer:
  acceleration: disabled
  dsrDispatch: geneve
  mode: dsr
nodePort:
  enabled: true
  range: 30000,33000
tunnelProtocol: geneve

Cilium Version

Client: 1.14.2 a6748946 2023-09-09T20:59:33+00:00 go version go1.20.8 linux/amd64 Daemon: 1.14.2 a6748946 2023-09-09T20:59:33+00:00 go version go1.20.8 linux/amd

Kernel Version

Linux zl-dev-k8s-worker-10-30-80-157 5.14.0-162.6.1.el9_1.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 18 02:06:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.10+rke2r2", GitCommit:"b8609d4dd75c5d6fba4a5eaa63a5507cb39a6e99", GitTreeState:"clean", BuildDate:"2023-11-02T16:18:02Z", GoVersion:"go1.20.10 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

Code of Conduct

squeed commented 1 month ago

So, you've run in to one of the rough edges of NetworkPolicy (not just Cilium, BTW) and policies. The behavior you're experiencing is probably correct, if totally unexpected.

You are connecting from pod A to pod B via a NodePort service. This means you do not connect to pod B, but to node 1's IP address:

graph LR

    A[pod A]
    B[pod B]
    1[node 1]

    A --src A, dst 1 --> 1 -- src 1, dst B --> B

So, node 1 is doing the service translation in this case, and because it defers the routing decision until after service translation, it can treat this as a Pod-to-Pod flow and preserve the source identity. This is the same as if you were to connect to a ClusterIP service.

However, when you connect to the NodePort service on another node, the flow is different:

graph LR

    A[pod A]
    B[pod B]
    1[node 1]
    2[node 2]

    A --src A, dst 2 --> 1 -- nat! src 1, dst 2 --> 2 -- src 2, dst B -->B

Because you are connecting to node 2, the traffic needs to exit node 1, which means it is NATted. That means the source IP is that of node 1.

The fix

There are two potential fixes:

  1. Allow access from the host and remote-node entities in your policy.
  2. Connect to a ClusterIP, not a NodePort.

Can you try these and see if they fix your problem?

zzbmmbzz commented 1 month ago

So, you've run in to one of the rough edges of NetworkPolicy (not just Cilium, BTW) and policies. The behavior you're experiencing is probably correct, if totally unexpected.

You are connecting from pod A to pod B via a NodePort service. This means you do not connect to pod B, but to node 1's IP address:

graph LR

    A[pod A]
    B[pod B]
    1[node 1]

    A --src A, dst 1 --> 1 -- src 1, dst B --> B

So, node 1 is doing the service translation in this case, and because it defers the routing decision until after service translation, it can treat this as a Pod-to-Pod flow and preserve the source identity. This is the same as if you were to connect to a ClusterIP service.

However, when you connect to the NodePort service on another node, the flow is different:

graph LR

    A[pod A]
    B[pod B]
    1[node 1]
    2[node 2]

    A --src A, dst 2 --> 1 -- nat! src 1, dst 2 --> 2 -- src 2, dst B -->B

Because you are connecting to node 2, the traffic needs to exit node 1, which means it is NATted. That means the source IP is that of node 1.

The fix

There are two potential fixes:

  1. Allow access from the host and remote-node entities in your policy.
  2. Connect to a ClusterIP, not a NodePort.

Can you try these and see if they fix your problem?

I connect from outside cluster, not inside cluster, so Connect to a ClusterIP is not possible, Allow access from the host and remote-node entities in your policy. entities cluster is include host and remote-node

Cluster is the logical group of all network endpoints inside of the local cluster. This includes all Cilium-managed endpoints of the local cluster, unmanaged endpoints in the local cluster, as well as the host, remote-node, and init identities.
squeed commented 1 month ago

Oh, my apologies, I didn't realize the connection was external from the cluster (even though you said it in the title). That does make it a bit more interesting.

I suspect we need to re-look-up the destination identity for policy, rather than trusting the identity from the GENEVE headers. I'll ask for a bit more info.

squeed commented 1 month ago

Aha, after chatting with the magnificent @networkop, he observed that this was fixed in in v1.15 in #29155

squeed commented 1 month ago

I note you are on a quite old cilium version; please consider upgrading. The fix was backported to v1.14.

zzbmmbzz commented 1 month ago

Many thanks, @squeed . My issue fixed after upgrade to v1.15