microk8s cross node communication not working

RobinJespersen commented 2 years ago

My service / pod is only reachable from the node it is executed on.

my setup

I have three fresh and identical Ubuntu 20.04.4 LTS servers, each with its own public IP address.

I installed microk8s on all nodes by running: sudo snap install microk8s --classic

On the master node I executed microk8s add-node and joined the two other servers by executing microk8s join XXX.XXX.X.XXX:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05

After that, by running kubectl get no I can see the three nodes all having the status ready. And kubectl get all --all-namespaces shows

NAMESPACE     NAME                                          READY   STATUS    RESTARTS      AGE
kube-system   pod/calico-node-hwsvj                         1/1     Running   1 (63m ago)   72m
kube-system   pod/calico-node-zd6rc                         1/1     Running   1 (62m ago)   71m
kube-system   pod/calico-node-djkmk                         1/1     Running   1 (62m ago)   72m
kube-system   pod/calico-kube-controllers-dc44f6cdf-flj54   1/1     Running   2 (62m ago)   74m

NAMESPACE   NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
default     service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   75m

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   3         3         3       3            3           kubernetes.io/os=linux   75m

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           75m

NAMESPACE     NAME                                                DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/calico-kube-controllers-dc44f6cdf   1         1         1       74m

wget --no-check-certificate https://10.152.183.1/ executed on all nodes returns always

WARNING: cannot verify 10.152.183.1's certificate, issued by ‘CN=10.152.183.1’:
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.

So far everything works as expected.

problem 1

I get the IP of calico-kube-controllers by calling kubectl describe -n=kube-system pod/calico-kube-controllers-dc44f6cdf-flj54

And executing wget https://10.1.50.194/ on the "master" node returns

Connecting to 10.1.50.194:443... failed: Connection refused.

and on the two other nodes

Connecting to 10.1.50.194:80... failed: Connection timed out.

For my understanding, the IP of the pod should be reachable from all nodes. Is that correct?

problem 2

I installed the following deployment by calling

kubectl apply -f ./deployment.yaml
kubectl apply -f ./service.yaml

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test-deployment
  name: test-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deployment
  template:
    metadata:
      labels:
        app: test-deployment
    spec:
      containers:
      - image: dontrebootme/microbot:v1
        imagePullPolicy: IfNotPresent
        name: microbot
        resources: {}
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}

# service.yaml
apiVersion: v1 
kind: Service 
metadata:
  name: test-service 
  namespace: default
spec:
  type: ClusterIP
  selector:
    app: test-deployment
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80

kubectl get all --all-namespaces

NAMESPACE     NAME                                          READY   STATUS    RESTARTS      AGE
kube-system   pod/calico-node-hwsvj                         1/1     Running   1 (91m ago)   101m
kube-system   pod/calico-node-zd6rc                         1/1     Running   1 (91m ago)   100m
kube-system   pod/calico-node-djkmk                         1/1     Running   1 (91m ago)   101m
kube-system   pod/calico-kube-controllers-dc44f6cdf-flj54   1/1     Running   2 (91m ago)   103m
default       pod/test-deployment-5899c5ff7d-d442g          1/1     Running   0             59s

NAMESPACE   NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
default     service/kubernetes     ClusterIP   10.152.183.1     <none>        443/TCP   103m
default     service/test-service   ClusterIP   10.152.183.247   <none>        80/TCP    31s

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   3         3         3       3            3           kubernetes.io/os=linux   103m

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           103m
default       deployment.apps/test-deployment           1/1     1            1           59s

NAMESPACE     NAME                                                DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/calico-kube-controllers-dc44f6cdf   1         1         1       103m
default       replicaset.apps/test-deployment-5899c5ff7d          1         1         1       59s

Calling wget http://10.152.183.247/ on all nodes returns twice

--2022-05-06 10:34:04--  http://10.152.183.247/
Connecting to 10.152.183.247:80... failed: Connection timed out.
Retrying.

and once

<!DOCTYPE html>
<html>
  <style type="text/css">
    .centered
      {
      text-align:center;
      margin-top:0px;
      margin-bottom:0px;
      padding:0px;
      }
  </style>
  <body>
    <p class="centered"><img src="microbot.png" alt="microbot"/></p>
    <p class="centered">Container hostname: test-deployment-5899c5ff7d-d442g</p>
  </body>
</html>

For my understanding, the service of should be reachable from all nodes. Calling wget on the ip of the pod itself shows exactly the same behavior.

workaround

Adding hostNetwork: true to the deployment makes the service accessible from all nodes, but that seems to be the wrong way of doing it.

Does anyone have an Idea how I can debug this? I am out of Ideas.

RobinJespersen commented 2 years ago

sudo iptables -t nat -nL |grep "10\.152\.183\." returns on all nodes

KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.152.183.1         /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-B62C23KNXVA7TMZN  tcp  --  0.0.0.0/0            10.152.183.247       /* default/test-service:http cluster IP */ tcp dpt:80
KUBE-MARK-MASQ  tcp  -- !10.1.0.0/16          10.152.183.247       /* default/test-service:http cluster IP */ tcp dpt:80
KUBE-MARK-MASQ  tcp  -- !10.1.0.0/16          10.152.183.1         /* default/kubernetes:https cluster IP */ tcp dpt:443

RobinJespersen commented 2 years ago

microk8s status returns

microk8s is running
high-availability: yes
  datastore master nodes: XXX.XXX.XXX.XXX:19001 XXX.XXX.XXX.XXX:19001 XXX.XXX.XXX.XXX:19001
  datastore standby nodes: none
addons:
  enabled:
    ha-cluster           # Configure high availability on the current node
  disabled:
    ambassador           # Ambassador API Gateway and Ingress
    cilium               # SDN, fast with full network policy
    dashboard            # The Kubernetes dashboard
    dashboard-ingress    # Ingress definition for Kubernetes dashboard
    dns                  # CoreDNS
    fluentd              # Elasticsearch-Fluentd-Kibana logging and monitoring
    gpu                  # Automatic enablement of Nvidia CUDA
    helm                 # Helm 2 - the package manager for Kubernetes
    helm3                # Helm 3 - Kubernetes package manager
    host-access          # Allow Pods connecting to Host services smoothly
    inaccel              # Simplifying FPGA management in Kubernetes
    ingress              # Ingress controller for external access
    istio                # Core Istio service mesh services
    jaeger               # Kubernetes Jaeger operator with its simple config
    kata                 # Kata Containers is a secure runtime with lightweight VMS
    keda                 # Kubernetes-based Event Driven Autoscaling
    knative              # The Knative framework on Kubernetes.
    kubeflow             # Kubeflow for easy ML deployments
    linkerd              # Linkerd is a service mesh for Kubernetes and other frameworks
    metallb              # Loadbalancer for your Kubernetes cluster
    metrics-server       # K8s Metrics Server for API access to service metrics
    multus               # Multus CNI enables attaching multiple network interfaces to pods
    openebs              # OpenEBS is the open-source storage solution for Kubernetes
    openfaas             # OpenFaaS serverless framework
    portainer            # Portainer UI for your Kubernetes cluster
    prometheus           # Prometheus operator for monitoring and logging
    rbac                 # Role-Based Access Control for authorisation
    registry             # Private image registry exposed on localhost:32000
    storage              # Storage class; allocates storage from host directory
    traefik              # traefik Ingress controller for external access

balchua commented 2 years ago

There was a recent fix that is related to netfilter and calico.

It recommended to use a more specific channel for example --channel=1.24/stable

RobinJespersen commented 2 years ago

Thx for the hint. I removed everything with sudo snap remove microk8s --purge and installed it with sudo snap install microk8s --classic --channel=1.24/stable and tried everything again, but still the same problem.

Last week I also tried 1.18/stable and had the same problem.

balchua commented 2 years ago

You added the node's hostnames to /etc/hosts to each node? I remember that i have to do that.

RobinJespersen commented 2 years ago

should not be necessary as the hostnames are public reachable DNS names

I already tried adding the nodes like this microk8s join host.name:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05 but no success either.

balchua commented 2 years ago

Do you by any chance have 2 network interfaces?

RobinJespersen commented 2 years ago

ifconfig returns on the first onde

cali8f14469af57: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::ecee:eeff:feee:eeee  prefixlen 64  scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee  txqueuelen 0  (Ethernet)
        RX packets 51413  bytes 4344562 (4.3 MB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 46028  bytes 34095652 (34.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

caliedc83d82522: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1440
        inet6 fe80::ecee:eeff:feee:eeee  prefixlen 64  scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee  txqueuelen 0  (Ethernet)
        RX packets 11973  bytes 990374 (990.3 KB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 11068  bytes 5574873 (5.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
# public address 

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 7013747  bytes 2005794268 (2.0 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 7013747  bytes 2005794268 (2.0 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.1.50.192  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::640b:86ff:fea7:d83b  prefixlen 64  scopeid 0x20<link>
        ether 66:0b:86:a7:d8:3b  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 34  bytes 2040 (2.0 KB)
        TX errors 0  dropped 7 overruns 0  carrier 0  collisions 0

on the second

cali5d301cb26b2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::ecee:eeff:feee:eeee  prefixlen 64  scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee  txqueuelen 0  (Ethernet)
        RX packets 42  bytes 3562 (3.5 KB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 16  bytes 1440 (1.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cali80701c0b6cd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1440
        inet6 fe80::ecee:eeff:feee:eeee  prefixlen 64  scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee  txqueuelen 0  (Ethernet)
        RX packets 53  bytes 4860 (4.8 KB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 25  bytes 2132 (2.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
# public address 

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 2050408  bytes 545622554 (545.6 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2050408  bytes 545622554 (545.6 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.1.230.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::6479:6ff:feb0:51a1  prefixlen 64  scopeid 0x20<link>
        ether 66:79:06:b0:51:a1  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 32  bytes 1920 (1.9 KB)
        TX errors 0  dropped 7 overruns 0  carrier 0  collisions 0

and on the third

ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
# public address 

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1442966  bytes 345062323 (345.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1442966  bytes 345062323 (345.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.1.179.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::640a:7ff:fe61:8a54  prefixlen 64  scopeid 0x20<link>
        ether 66:0a:07:61:8a:54  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 64  bytes 3840 (3.8 KB)
        TX errors 0  dropped 7 overruns 0  carrier 0  collisions 0

RobinJespersen commented 2 years ago

I purged the installation on node three and reinstalled microk8s and joined the cluster again. Now ifconfig shows

calie8f9a2c7112: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::ecee:eeff:feee:eeee  prefixlen 64  scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee  txqueuelen 0  (Ethernet)
        RX packets 210  bytes 23882 (23.8 KB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 213  bytes 104455 (104.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
# public address 

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1479365  bytes 371918667 (371.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1479365  bytes 371918667 (371.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.1.179.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::640a:7ff:fe61:8a54  prefixlen 64  scopeid 0x20<link>
        ether 66:0a:07:61:8a:54  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 64  bytes 3840 (3.8 KB)
        TX errors 0  dropped 7 overruns 0  carrier 0  collisions 0

But the problem still exists.

Just run ifconfig again and now the calie8f9a2c7112 interface is gone.

balchua commented 2 years ago

Excluding network interface lo, cali* and vxlan, you only have ens?

Does your hostname comes with capital letters? In short does it follow the rules here? https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names

All calico pods are stable? I don't know if this can help https://github.com/canonical/microk8s/issues/1554#issuecomment-691426908

RobinJespersen commented 2 years ago

Excluding network interface lo, cali* and vxlan, you only have ens?

yes

In short does it follow the rules here? https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names

yes

All calico pods are stable? I don't know if this can help https://github.com/canonical/microk8s/issues/1554#issuecomment-691426908

As far as I can tell, yes. At least they don't have any restarts.

In /var/snap/microk8s/current/args/cni-network/cni.yaml I have the entry

 - name: IP_AUTODETECTION_METHOD
    value: "can-reach=XXX.XXX.XXX.XXX"

with XXX.XXX.XXX.XXX being the IP of node two on the first node and the IP of node one for the other two.

usersina commented 2 years ago

I had a setup where all of my nodes (1 controller and 2 workers) were on the same private network. However kubectl get nodes -o wide shows the public IP addresses in the Internal IP column when I do the join operation. So I had to monkey patch it and it solved my issue.

RobinJespersen commented 2 years ago

@usersina thanks for the hint but does not help :-(

My nodes are only on a public network, so I entered in both files the public IPs. Before and after kubectl get nodes -o wide shows the public IPs in the Internal-IP column and <none> in the External-IP column.

RobinJespersen commented 2 years ago

Meanwhile I also replaced one node by a Debian 11. But still exactly the same behavior.

usersina commented 2 years ago

Do you by any chance have 2 network interfaces?

What to do when you have two network interfaces? This still does not work for me so I still have to patch the cluster after joining. The patching however almost always fails if DNS is enabled due to timeouts. Also note that patching before joining is not possible.

IDevJoe commented 2 years ago

Just reproduced this issue with Ubuntu 20.04 on arm64 on a clean install. Seems to effect just ClusterIP services - was able to get LoadBalancers working. Retrying again tomorrow.

IDevJoe commented 2 years ago

Continuing to investigate: did a packet capture on eth0 (my primary interface) to make sure that packets were getting sent. This was the result:

11:44:55.145179 IP 10.1.10.193.50184 > 10.1.187.66.domain: 36185+ A? ports.ubuntu.com.default.svc.cluster.local. (60)
11:44:55.145287 IP 10.1.10.193.50184 > 10.1.187.66.domain: 56907+ AAAA? ports.ubuntu.com.default.svc.cluster.local. (60)

The packets were never seen at the destination node.

IDevJoe commented 2 years ago

The route to get to the other node never gets added. Manually adding the route through ip route enables temporary communication. @balchua, any chance you could look into this further?

This is what the routing table looks like by default:

ubuntu@k81:~$ ip route
default via 10.0.0.1 dev eth0 proto static
10.0.0.0/27 dev eth0 proto kernel scope link src 10.0.0.6
blackhole 10.1.10.192/26 proto 80
10.1.10.193 dev califb3eb82ef50 scope link

briantilburgs commented 1 year ago

Looks like I have the same issue but my routing table looks filled:

home-kube01:~$ ip route
default via 192.168.200.1 dev eth0 proto static 
blackhole 10.1.158.128/26 proto 80 
10.1.158.159 dev cali463d9a511a6 scope link 
10.1.158.161 dev cali2019a39bf40 scope link 
10.1.158.162 dev cali675ab5b64e3 scope link 
10.1.158.163 dev calic88c1e0b9f9 scope link 
10.1.158.164 dev cali8a7384016d7 scope link 
10.1.158.183 dev cali42a5ceceaa4 scope link 
10.1.158.184 dev calia223862cd7d scope link 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.200.0/24 dev eth0 proto kernel scope link src 192.168.200.231

mel-florance commented 10 months ago

Excluding network interface lo, cali* and vxlan, you only have ens?

yes

In short does it follow the rules here? https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names

yes

All calico pods are stable? I don't know if this can help #1554 (comment)

As far as I can tell, yes. At least they don't have any restarts.

In /var/snap/microk8s/current/args/cni-network/cni.yaml I have the entry
 - name: IP_AUTODETECTION_METHOD
    value: "can-reach=XXX.XXX.XXX.XXX"
with XXX.XXX.XXX.XXX being the IP of node two on the first node and the IP of node one for the other two.

Bro wtf ?

yukiman76 commented 3 months ago

We are also seeing this same problem

canonical / microk8s