NodePort Services do not work with new Node addresses

aojea commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

A Service NodePort type exposes the Pods backends via a NodePort on the node addresses, forwarding any traffic from the frontend given by NodeAddresses:NodePort.

The key here is the concept of Node Addresses, after a quick skim to the code, it seems that cilium only consider the Node Addresses that are associated to the Service (ExternalIP, LoadBalancerIP) or the ones associated to the node object Node.Status.InternalIP or ExternalIP, but there are situations that a node can get new IPs and those are not reflected in Kubernetes. Advanced networking kubernetes users use to have more complex setups with multiple NICs, and they need to receive all the NodePort traffic on all the nics, or. just some subset of them.

Kube-proxy has a flag that allow users to define the range of addresses they want to specify to use for NodePorts, it defaults to all local address, and everytime that a new address is added to the host, is considered for the NodePort if it matches the range

--nodeport-addresses strings
  | A string slice of values which specify the addresses to use for NodePorts. Values may be valid IP blocks (e.g. 1.2.3.0/24, 1.2.3.4/32). The default empty string slice ([]) means to use all local addresses. This parameter is ignored if a config file is specified by --config.

I expect cilium to behave the same as kube-proxy, so new addresses on the nodes can be used to receive NodePort traffic, and also have the capacity to define which ranges of the local addresses should be allowed.

How to repro

You can repro with GKE: create a NodePort service and manually create a GCE loadbalancer. GCE load balancers add the LB frontend IP to the loopback interface of all the nodes, and then route the traffic to that IP through the nodes. The observation is that all the healthchecks directed to the GCE LB IP and the NodePort are dropped, , if my intuition is correct, it seems this new IP is not considered by cilium, hence it will not forward the traffic to it.

However, it is really easy to repo with Kind

Create a kind cluster with cilium

$ KUBEPROXY_MODE=none contrib/scripts/kind.sh
$ make kind-image
$ cilium install --wait --chart-directory=install/kubernetes/cilium \
        --helm-set=image.useDigest=false \
        --helm-set=clustermesh.apiserver.image.useDigest=false \
        --helm-set-string=kubeProxyReplacement=strict \
        --helm-set=sessionAffinity=true \
        --rollback=false \
        --config monitor-aggregation=none \
        --version=

Create a nodeport service with working backends

cat > nodeport.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nodeport-deployment
labels:
app: MyApp
spec:
replicas: 1
selector:
matchLabels:
  app: MyApp
template:
metadata:
  labels:
    app: MyApp
spec:
  containers:
  - name: agnhost
    image: k8s.gcr.io/e2e-test-images/agnhost:2.40
    args:
      - netexec
      - --http-port=8080
    ports:
    - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: nodeport-service
spec:
type: NodePort
selector:
app: MyApp
ports:
- protocol: TCP
  port: 80
  targetPort: 8080

Create and test the Service is working

$ kubectl get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINESS GATES
nodeport-deployment-5fb57b8c47-5b9m9   1/1     Running   0          94s   10.244.0.38   kind-control-plane   <none>           <none>

$ kubectl get service
NAME               TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
kubernetes         ClusterIP   10.96.0.1     <none>        443/TCP        10m
nodeport-service   NodePort    10.96.99.30   <none>        80:31666/TCP   99s

$ kubectl get nodes -o wide
NAME                 STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION           CONTAINER-RUNTIME
kind-control-plane   Ready    control-plane   10m   v1.26.0   192.168.8.7   <none>        Ubuntu 22.04.1 LTS   5.19.11-1rodete1-amd64   containerd://1.6.12
kind-worker          Ready    <none>          10m   v1.26.0   192.168.8.2   <none>        Ubuntu 22.04.1 LTS   5.19.11-1rodete1-amd64   containerd://1.6.12

$ curl  192.168.8.2:31666
NOW: 2023-03-20 23:20:31.030389188 +0000 UTC m=+46.956877549

Add a new IP to the eth0 interface and it fails now

docker exec -it kind-control-plane bash ip addr add 192.168.8.99/24 dev eth0


$curl  192.168.8.99:31666/hostname
curl: (7) Failed to connect to 192.168.8.99 port 31666 after 0 ms: Couldn't connect to server



### Cilium Version

crictl exec -it 87162a61184cb cilium version
Client: 1.13.90 67134ed20d 2023-03-18T17:21:44+00:00 go version go1.20.2 linux/amd64
Daemon: 1.13.90 67134ed20d 2023-03-18T17:21:44+00:00 go version go1.20.2 linux/amd64

### Kernel Version

Linux kind-control-plane 5.19.11-1

### Kubernetes Version

"v1.26.0"

### Sysdump

_No response_

### Relevant log output

_No response_

### Anything else?

happy to send a PR with some guidance, or to review if someone wants to take it :)

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

brb commented 1 year ago

Yep, we don't support --nodeport-addresses - https://github.com/cilium/cilium/issues/12508.

Anyway, I'd expect the device runtime detection to pick up the new IP addr, but I think it can react only to new ifaces (cc @joamaki)?

One fix would be to extend the NodePort service lookup by adding an additional lookup in a map with node local addrs, but I think there were some performance concerns (cc @borkmann @aspsk).

aojea commented 1 year ago

One fix would be to extend the NodePort service lookup by adding an additional lookup in a map with node local addrs, but I think there were some performance concerns

yeah, getting the address can be ala kube-proxy there https://github.com/kubernetes/kubernetes/pull/103116#pullrequestreview-694419686, as you may know kube-proxy reconcile every X secs, and is when it gets the new addresses ... or the netlink notify thing https://github.com/vishvananda/netlink/blob/v1.1.0/addr_linux.go#L301 ... I don't know well how cilium needs to cascade this information to the Service.

I don't think that is critical the time between an address is added to the system to the time the service is updated, is just that new addresses are never added ... and adding addresses to loopback is a common practice

aojea commented 1 year ago

Anyway, I'd expect the device runtime detection to pick up the new IP addr, but I think it can react only to new ifaces (cc @joamaki)?

Agree, especially if this is works after restarting the cilium agent, it will be a very bad experience if users have to restart the cilium-agenteverytime they want to add a new IP to the nodes.

joamaki commented 1 year ago

Yep currently we only recompute the nodeport frontends when new devices are added or removed (when --enable-runtime-device-detection is on). We're working on refactoring the service load-balancing control-plane and supporting changing IP addresses will be fairly easy to do as part of this work. I can't yet give good estimates on when this would land as it's fairly long list of changes we need to churn through.

aojea commented 1 year ago

. I can't yet give good estimates on when this would land as it's fairly long list of changes we need to churn through.

knowing that is in your radar is a good thing, if you have workable items to speed this up please let me know

aojea commented 1 year ago

@joamaki can we close this as fixed by https://github.com/cilium/cilium/pull/24677?

joamaki commented 1 year ago

@joamaki can we close this as fixed by #24677?

Oops, forgot to reply to this. No we can't close it yet. #24677 added the facilities to react to these changes and in addition to that we'll need to change bunch of places to deal with 1) changing devices and addresses, 2) allow more than one IP per device for NodePort (which might break some use-cases). Working on these changes right now.

zzjin commented 9 months ago

Same issue comes here. Any updates?

We want to use GCE loadbalancer to expose NodePort apps (eg: mysql/pg tcp port) to external access, but as GCE inject local route instead of append new iface, we cannot access any NodePort ports from lbloadbalancer's ip.

joamaki commented 7 months ago

@zzjin We are aiming to have this solved in v1.16. For earlier releases you can use the experimental "enableRuntimeDeviceDetection" option. Be aware that not all features support it (for example bandwidth management).

zzjin commented 7 months ago

@joamaki Hi, tested with cilium 1.16.0-pre.1&&enableRuntimeDeviceDetection: true, but still get no work.

Seems that GCE assign LB ip into local ip route table. Which means there still one iface with only one ip, and one more ip route bound with it.

ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:92:00:04 brd ff:ff:ff:ff:ff:ff
    altname enp0s4
    inet 10.146.0.4/32 metric 100 scope global dynamic ens4
       valid_lft 3580sec preferred_lft 3580sec
    inet6 fe80::4001:aff:fe92:4/64 scope link
       valid_lft forever preferred_lft forever
3: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default qlen 1000
    link/ether ba:16:e6:26:59:89 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::b816:e6ff:fe26:5989/64 scope link
       valid_lft forever preferred_lft forever
4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default qlen 1000
    link/ether 12:e5:47:3f:62:ad brd ff:ff:ff:ff:ff:ff
    inet 100.64.18.136/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::10e5:47ff:fe3f:62ad/64 scope link
       valid_lft forever preferred_lft forever
5: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ea:02:80:3f:74:59 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e802:80ff:fe3f:7459/64 scope link
       valid_lft forever preferred_lft forever

ip route show table local

local 10.146.0.4 dev ens4 proto kernel scope host src 10.146.0.4
local 34.84.193.26 dev ens4 proto 66 scope host
local 100.64.18.136 dev cilium_host proto kernel scope host src 100.64.18.136

I can access nodeport via 10.146.0.4 but not via 34.84.193.26

joamaki commented 7 months ago

@zzjin thanks a lot for testing! We currently choose the NodePort frontend IP addresses from the IPs assigned to the devices, and by default only one IPv4 and one IPv6 per device is used (private addresses preferred). This logic can now by changed with the --nodeport-addresses option (soon also a helm value: https://github.com/cilium/cilium/pull/31672). If the address only appears in the routing table then it's not seen by Cilium.

@aojea in the description of this issue mentioned that the LB addresses should be assigned to the loopback device, in which case Cilium can be made to include them by specifying devices (needs this fix though: https://github.com/cilium/cilium/pull/31200). In the ip addr listing above lo however doesn't seem to have the address assigned. Is this unexpected?

If the IP is only added as a route then we perhaps need to make pkg/datapath/tables/node_address.go use the routing table as the source for node addresses instead of the device addresses...

zzjin commented 7 months ago

@joa Manual setting --nodeport-addresses in configmap and restart cilium still not works.

joamaki commented 6 months ago

Verified this myself now, GCE load-balancer is indeed added only as a route in the local table. The solution I think is as I said above, we need to derive the addresses for NodePort use from the routing table and not from the addresses. Effectively the change would be to switch pkg/datapath/tables/node_address.go to look at both the Table[Device] (for selecting what devices we care about) and Table[Route] for the local host-scoped routes. We'll see if we can land also this change for v1.16, but can't promise it. (EDIT: Did not make it)

I am a bit puzzled by the kube-proxy code which does net.InterfaceAddrs() which in turn only interrogates the addresses. Do I understand it correctly that kube-proxy doesn't support this either? Or it works regardless of its nodeport address logic not picking up on it?

cilium / cilium