networkop commented 4 years ago

Bug report

When Cilium is configured as a kube-proxy replacement, it fails to masquerade the source IP of pods when the target pod is in the hostNetwork of one of the k8s nodes.

General Information

Cilium version

Client: 1.8.2 aa42034f0 2020-07-23T15:02:39-07:00 go version go1.14.6 linux/amd64
Daemon: 1.8.2 aa42034f0 2020-07-23T15:02:39-07:00 go version go1.14.6 linux/amd64

Kernel version

Linux primary-external-workers-uksouth-machine-node-0 5.4.0-1022-azure #22-Ubuntu SMP Fri Jul 10 06:14:37 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Orchestration system version in use

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

How to reproduce the issue

Using kubespray (or manually with kubeadm) build a cluster on Azure with cloud-provider disabled. API-server pods will be running as static pods in host OS namespace on 3 controller nodes.

k describe svc kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
               provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.0.128.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.0.255.10:6443,10.0.255.6:6443,10.0.255.7:6443
Session Affinity:  None
Events:            <none>

Deploy a test pod on any worker node and attempt to connect to the API server clusterIP. This results in timeouts
```
bash-5.0# nc -zv -w 5 10.0.128.1 443
nc: connect to 10.0.128.1 port 443 (tcp) timed out: Operation in progress
```

Additional details

I have a test pod (10.0.6.223) trying to connect to the API server on 10.0.128.1:443. When doing a tcpdump on the underlying node, I cannot see SNAT'ed packets.

root@primary-external-workers-uksouth-machine-node-0:~# tcpdump -i any -n 'host 10.0.255.10 or host 10.0.255.6 or 10.0.255.7'
14:01:30.463349 IP 10.0.6.223.35832 > 10.0.255.7.6443: Flags [S], seq 3752066639, win 64860, options [mss 1410,sackOK,TS val 1253644246 ecr 0,nop,wscale 7], length 0
14:01:34.687357 IP 10.0.6.223.35832 > 10.0.255.7.6443: Flags [S], seq 3752066639, win 64860, options [mss 1410,sackOK,TS val 1253648470 ecr 0,nop,wscale 7], length 0

Since the source IP of the pod is not masqueraded, it gets dropped by Azure's networking stack.

According to @brb's comment it should have been translated on TC egress, however I don't see it happening (see above tcpdump).

As soon as I change enable-bpf-masquerade to "false" and restart the agent on the node, connectivity gets restored.

I've got the environment up and running for a while so I'm happy to collect any additional logs/outputs.

Here's the cilium configmap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"
  enable-bpf-clock-probe: "true"

  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all
  # bpf-policy-map-max specified the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"
  # Specifies the ratio (0.0-1.0) of total system memory to use for dynamic
  # sizing of the TCP CT, non-TCP CT, NAT and policy BPF maps.
  bpf-map-dynamic-size-ratio: "0.0025"

  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # This may lead to policy drops or a change in loadbalancing decisions for a
  # connection for some time. Endpoints may need to be recreated to restore
  # connectivity.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: vxlan

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  # (Michael) workaround for https://github.com/cilium/cilium/issues/12699
  enable-bpf-masquerade: "false"
  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"
  auto-direct-node-routes: "false"
  kube-proxy-replacement:  "strict"
  enable-health-check-nodeport: "true"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-session-affinity: "true"
  enable-endpoint-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
  operator-api-serve-addr: "127.0.0.1:9234"
  # Enable Hubble gRPC service.
  enable-hubble: "true"
  # UNIX domain socket for Hubble server to listen to.
  hubble-socket-path:  "/var/run/cilium/hubble.sock"
  # An additional address for Hubble server to listen to (e.g. ":4244").
  hubble-listen-address: ""
  ipam: "kubernetes"
  disable-cnp-status-updates: "true"

  # (Michael) workaround for https://github.com/cilium/cilium/issues/10627
  blacklist-conflicting-routes: "false"

aditighag commented 4 years ago

CC @brb

brb commented 4 years ago

@networkop Thanks for the issue. Two things - could you run cilium bpf ipcache get 10.0.255.7 from the node which runs the client pod, and could you provide a sysdump (https://docs.cilium.io/en/v1.8/troubleshooting/#automatic-log-state-collection)?

networkop commented 4 years ago

sure, here's the bpf ipcache output first

$ cilium bpf ipcache get 10.0.255.7
10.0.255.7 maps to identity 6 0 0.0.0.0

networkop commented 4 years ago

@brb where can I upload the sysdump?

networkop commented 4 years ago

oh wow, didn't know about GH's drag-and-drop cilium-sysdump-20200827-160052.zip

brb commented 4 years ago

@networkop Could you try w/o tunneling? You can do that by setting auto-direct-node-routes: true and tunnel: disabled in the ConfigMap, and then restarting all cilium-agent pods? Also, you need to make sure that Azure underlying network won't drop packets from/to an unknown subnet (which is a Pod CIDR).

networkop commented 4 years ago

ok, l'll try. do you want me to collect any logs or just verify that it works?

brb commented 4 years ago

Thanks. Just to verify that it works.

networkop commented 4 years ago

I've done the test, however I still don't have the e2e connectivity. In addition to changing the above flags I also had to specify the native-routing-cidr to stop the agent from crashing. I set it to the same value as pod cidr (in my case 10.0.0.0/18).

Additionally, in Azure I enabled "ip forwarding" on the NIC of the worker node I was testing, to stop it from dropping unknown packets.

The only new things I've noticed was that traceroute now looks a bit more like what you'd expected in the beginning:

15:14:57.811789 IP 10.0.6.173.50440 > 10.0.255.10.6443: Flags [S], seq 3237958959, win 65280, options [mss 1360,sackOK,TS val 2709627184 ecr 0,nop,wscale 7], length 0
15:14:59.162251 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [S], seq 2787558301, win 64240, options [mss 1460,sackOK,TS val 560436112 ecr 0,nop,wscale 7], length 0
15:14:59.163323 IP 10.0.255.10.6443 > 10.0.255.4.54334: Flags [S.], seq 4113293485, ack 2787558302, win 65160, options [mss 1418,sackOK,TS val 2268648707 ecr 560436112,nop,wscale 7], length 0
15:14:59.163377 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [.], ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 0
15:14:59.163417 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 517

So it looks like there's a SNAT'ed packet, however I can't understand why the translated session keeps sending packets while I don't even see a SYN-ACK in the pod. So maybe it's a red herring.

brb commented 4 years ago

Hmm. Would it be possible to get an access to your Azure cluster (I'm martynas on Cilium's public slack)?

Mengkzhaoyun commented 4 years ago

the same bug , mark it

brb commented 4 years ago

@Mengkzhaoyun What is your setup and configuration?

Mengkzhaoyun commented 4 years ago

I deploy the cilium-dev:v1.9.0-rc0 & kubernetes v1.18.8 with out kube-proxy mod (3master ha with kube-vip) when i test the deploy on 3 ubuntu 20.04 hyper-v win10 vm , i found my kubernetes'dashboard pod can not access https://kube-apiserver:6443 , the error is (apiserver service endpoint) access time out.

I login to bash on dashboard pod , it can not curl access other host's apiserver , but bash on the pod' vm host , the curl https://kube-apiserver.othervm.local:6443/version run success.

I change enable-bpf-masquerade to false then my kubernetes'dashboard pod work correct. here is my config

{
    "auto-direct-node-routes": "false",
    "bpf-lb-map-max": "65536",
    "bpf-map-dynamic-size-ratio": "0.0025",
    "bpf-policy-map-max": "16384",
    "cluster-name": "default",
    "cluster-pool-ipv4-cidr": "10.0.0.0/11",
    "cluster-pool-ipv4-mask-size": "24",
    "debug": "false",
    "disable-cnp-status-updates": "true",
    "disable-envoy-version-check": "true",
    "enable-auto-protect-node-port-range": "true",
    "enable-bpf-clock-probe": "true",
    "enable-bpf-masquerade": "false",
    "enable-endpoint-health-checking": "true",
    "enable-external-ips": "true",
    "enable-health-check-nodeport": "true",
    "enable-host-port": "true",
    "enable-host-reachable-services": "true",
    "enable-hubble": "true",
    "enable-ipv4": "true",
    "enable-ipv6": "false",
    "enable-node-port": "true",
    "enable-remote-node-identity": "true",
    "enable-session-affinity": "true",
    "enable-well-known-identities": "false",
    "enable-xt-socket-fallback": "true",
    "hubble-listen-address": ":4244",
    "hubble-socket-path": "/var/run/cilium/hubble.sock",
    "identity-allocation-mode": "crd",
    "install-iptables-rules": "true",
    "ipam": "cluster-pool",
    "k8s-require-ipv4-pod-cidr": "true",
    "k8s-require-ipv6-pod-cidr": "false",
    "kube-proxy-replacement": "partial",
    "masquerade": "true",
    "monitor-aggregation": "medium",
    "monitor-aggregation-flags": "all",
    "monitor-aggregation-interval": "5s",
    "node-port-bind-protection": "true",
    "node-port-range": "20,65000",
    "operator-api-serve-addr": "127.0.0.1:9234",
    "preallocate-bpf-maps": "false",
    "sidecar-istio-proxy-image": "cilium/istio_proxy",
    "tunnel": "vxlan",
    "wait-bpf-mount": "false"
}

brb commented 4 years ago

@Mengkzhaoyun Thanks. Do you run cilium on the master nodes?

Mengkzhaoyun commented 4 years ago

there is 3 master nodes run cilium agent .

brb commented 4 years ago

@Mengkzhaoyun Can you please ping me (martynas) on the Cilium's public slack?

travisghansen commented 4 years ago

@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.

brb commented 4 years ago

@networkop @Mengkzhaoyun OK, so what happens in your case is that the wrong IPv4 addr is picked for SNAT, as the selected devices for NodePort have more than one IPv4 addr. Currently, cilium-agent can support only one per bpf_host.o instance, and it does not check which IP addr is used to communicate between k8s nodes (aka k8s Node IP). The fix would be to to prefer the k8s Node IP when selecting that single IP addr.

If any of you wants to work on the fix (a good opportunity to familiarize with Cilium internals), I'm happy to guide you.

networkop commented 4 years ago

I'm happy to pick it up @brb

brb commented 4 years ago

The fix might be simple: in the function https://github.com/cilium/cilium/blob/master/pkg/node/address.go#L128 you need to pass k8s Node IPv{4,6} addr to the invocations of firstGlobalV{4,6}Addr(...) as a second parameter.

Mengkzhaoyun commented 4 years ago

add CILIUM_IPV4_NODE env to cilium-agent can temp skip the bug.

        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_IPV4_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

travisghansen commented 4 years ago

@Mengkzhaoyun that makes my deployment crash as the nodeName is a hostname...aware of another property that pushes the IP definitively?

level=fatal msg="Invalid IPv4 node address" ipAddr=<node hostname> subsys=daemon

travisghansen commented 4 years ago

Here's the better way I suppose:

        - name: CILIUM_IPV4_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP

Didn't help my scenario so I must be hitting a different bug altogether.

brb commented 4 years ago

JFYI: CILIUM_IPV4_NODE doesn't have anything to do with fixing the issue.

brb commented 4 years ago

@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.

As per discussion offline - it's not related.

networkop commented 4 years ago

Here's a brief summary of the issues I was hitting:

The main issue was caused by the fact that one of the k8s nodes had multiple outgoing interfaces. Specifically, the interface that was used to communicate with the affected nodes was wireguard, which doesn't have a L2 (MAC) address and is currently not supported until #12317 gets resolved.
The secondary issue was that cilium agent is picking the wrong SNAT IP for traffic leaving the primary (non-wg) interface. This interface was a bond interface with 1 public and 1 private IP. Cilium was picking the public IP, ignoring the fact that that k8s node had InternalIP set to the private IP. The following patch fixed this issue for me:

diff --git a/pkg/node/address.go b/pkg/node/address.go
index a5fa3dadd..7c438a9d8 100644
--- a/pkg/node/address.go
+++ b/pkg/node/address.go
@@ -129,10 +129,14 @@ func InitNodePortAddrs(devices []string) error {
        if option.Config.EnableIPv4 {
                ipv4NodePortAddrs = make(map[string]net.IP, len(devices))
                for _, device := range devices {
-                       ip, err := firstGlobalV4Addr(device, nil)
+                       ip, err := firstGlobalV4Addr(device, GetK8sNodeIP())
                        if err != nil {
                                return fmt.Errorf("Failed to determine IPv4 of %s for NodePort", device)
                        }
+                       log.WithFields(logrus.Fields{
+                               "device": device,
+                               "ip":     ip,
+                       }).Info("Pinning node IPs")
                        ipv4NodePortAddrs[device] = ip
                }
        }
@@ -140,7 +144,7 @@ func InitNodePortAddrs(devices []string) error {
        if option.Config.EnableIPv6 {
                ipv6NodePortAddrs = make(map[string]net.IP, len(devices))
                for _, device := range devices {
-                       ip, err := firstGlobalV6Addr(device, nil)
+                       ip, err := firstGlobalV6Addr(device, GetK8sNodeIP())
                        if err != nil {
                                return fmt.Errorf("Failed to determine IPv6 of %s for NodePort", device)
                        }
diff --git a/pkg/node/address_linux.go b/pkg/node/address_linux.go
index a05a3f38d..bfa16fc73 100644
--- a/pkg/node/address_linux.go
+++ b/pkg/node/address_linux.go
@@ -86,7 +86,7 @@ retryScope:
        }

        if len(ipsPublic) != 0 {
-               if hasPreferred && ip.IsPublicAddr(preferredIP) {
+               if hasPreferred {
                        return preferredIP, nil
                }

Finally, the last issue was me (PBKAC). I started replicating the issue in the default namespace which also happened to have an egress network policy. cilium monitor -t drop helped identify that.

Thanks @brb for the support, let me know if you want me to do a PR with the above fix or it's ok to just close this issue for now.

Mengkzhaoyun commented 4 years ago

@networkop , good job , it is work.

aanm commented 4 years ago

@networkop can you push a PR with that fix? that would be great!

networkop commented 4 years ago

PR is ready for review :+1:

cilium / cilium

Prefer k8s Node IP for SNAT IP (IPV{4,6}_NODEPORT) when multiple IP addrs are set to BPF NodePort device #12988

Bug report

Additional details