canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.49k stars 773 forks source link

Sudden slow down (after `microk8s enable dashboard`) #2333

Closed romanchyla closed 1 year ago

romanchyla commented 3 years ago

Apologies for not knowing how to debug this properly; it might be just a coincidence --

I have 4 node cluster, microk8s.kubectl is answering swiftly but suddenly after mickrok8s enable dashboard there is a noticeable delay of ~10s

If I remove the (previous) master node microk8s leave then again cluster answers quickly

When I rejoin the cluster; after what seems like initial 5mins, then again the slow responses

inspection-report-20210611_203559.tar.gz

journalctl -f -u snap.microk8s.daemon-apiserver

is full with these log entries:

Jun 11 20:32:25 rbox0 microk8s.daemon-apiserver[2659527]: I0611 20:32:25.700285 2659527 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
Jun 11 20:32:30 rbox0 microk8s.daemon-apiserver[2659527]: E0611 20:32:30.310344 2659527 available_controller.go:508] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Jun 11 20:32:35 rbox0 microk8s.daemon-apiserver[2659527]: E0611 20:32:35.329679 2659527 available_controller.go:508] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Jun 11 20:32:40 rbox0 microk8s.daemon-apiserver[2659527]: E0611 20:32:40.344786 2659527 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.152.183.231:443/apis/metrics.k8s.io/v1beta1: Get "https://10.152.183.231:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.152.183.231:443: i/o timeout
Jun 11 20:32:45 rbox0 microk8s.daemon-apiserver[2659527]: E0611 20:32:45.345776 2659527 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.152.183.231:443/apis/metrics.k8s.io/v1beta1: Get "https://10.152.183.231:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.152.183.231:443: i/o timeout
Jun 11 20:32:50 rbox0 microk8s.daemon-apiserver[2659527]: E0611 20:32:50.354684 2659527 available_controller.go:508] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Jun 11 20:32:51 rbox0 microk8s.daemon-apiserver[2659527]: I0611 20:32:51.768057 2659527 client.go:360] parsed scheme: "passthrough"
balchua commented 3 years ago

hi @romanchyla

I think both the kube controller manager and kube-scheduler is crashing, when it loses its leader election.

Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.052546 3159596 shared_informer.go:247] Caches are synced for garbage collector
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.053023 3159596 garbagecollector.go:151] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.057285 3159596 shared_informer.go:247] Caches are synced for garbage collector
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.888294 3159596 shared_informer.go:240] Waiting for caches to sync for resource quota
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.888373 3159596 shared_informer.go:247] Caches are synced for resource quota
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: E0604 19:09:43.686552 3159596 leaderelection.go:361] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": context deadline exceeded
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:09:43.686737 3159596 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: F0604 19:09:43.686870 3159596 controllermanager.go:294] leaderelection lost
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: goroutine 1 [running]:
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc000ee7e00, 0x4c, 0x1e9)
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]:         /build/microk8s/parts/k8s-binaries/build/go/src/github.com/kubernetes/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9

same for the kube-scheduler.

Jun 04 18:46:17 rbox0 microk8s.daemon-scheduler[3059029]: I0604 18:46:17.660780 3059029 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: E0604 19:05:27.956209 3059029 leaderelection.go:361] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: I0604 19:05:27.956291 3059029 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: F0604 19:05:27.956320 3059029 server.go:205] leaderelection lost
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: goroutine 1 [running]:

Want to give this a try? https://github.com/ubuntu/microk8s/issues/2205#issuecomment-827208050

romanchyla commented 3 years ago

thanks @balchua - tried, applied the change on the nodes and restarted (they were restarting one after another, but quickly), it hasn't changed

with microk8s.kubectl delete apiservice v1beta1.metrics.k8s.io the messages however disappeared and the kubectl response time increases dramatically. Again, it might be a coincidence, but (to me at least) it seems related to something that was left over after microk8s enable dashboard

this was the objects definition before I removed it (I deleted the timestamp and uuid; in effort to force the network edits to pass - unsuccessfully) inspection-report-20210612_071710.tar.gz


apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{},"name":"v1beta1.metrics.k8s.io"},"spec":{"group":"metrics.k8s.io","groupPriorityMinimum":100,"insecureSkipTLSVerify":true,"service":{"name":"metrics-server","namespace":"kube-system"},"version":"v1beta1","versionPriority":100}}
  name: v1beta1.metrics.k8s.io
  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
    port: 443
  version: v1beta1
  versionPriority: 100
status:
  conditions:
  - lastTransitionTime: "2021-06-12T07:08:34Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available```
balchua commented 3 years ago

This may have come from the metrics server that came with the dashboard addon. This seem to slow down the datastore (dqlite in this case). We've seen issues like this when the storage is slow. In the docs there are steps to identify the performance of the rpi storage.
https://microk8s.io/docs/troubleshooting#heading--common-issues

May i know these are the steps you did?

  1. Create 4 node microk8s cluster on rpi4.
  2. Enable dashboard addon.
  3. Any other workload you run?

Wanted to see if this can be replicated. @ktsakalozos in case you have some idea. 😊

Update: I also noticed that the metrics server couldn't connect to the kubelet.

E0612 07:16:31.050856       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:rbox1: unable to fetch metrics from Kubelet rbox1 (192.168.3.71): Get https://192.168.3.71:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.3.71:10250: i/o timeout, unable to fully scrape metrics from source kubelet_summary:rbox2: unable to fetch metrics from Kubelet rbox2 (192.168.3.62): Get https://192.168.3.62:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.3.62:10250: i/o timeout, unable to fully scrape metrics from source kubelet_summary:rbox0: unable to fetch metrics from Kubelet rbox0 (192.168.3.60): Get https://192.168.3.60:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.3.60:10250: i/o timeout]

I don't see anything suspicious in the apiserver.

romanchyla commented 3 years ago

The slow dqlite is unlikely in my case; the cluster has NVME drives -- they are in fact very fast. There is one machine with SATA SSD (much slower than NVME, but still capable of writing 600MB/s)

The DNS inside the cluster is however not working correctly and since yesterday I'm trying to figure out why (it never worked) - as a learning experiment, I can easily rebuild the cluster - so I'm happy to try any suggestions you might have.

Here is how the cluster was built:

one every machine:

server.shell(
    name='Install Micro k8s',
    commands=[
        'snap install microk8s --classic',
        'usermod -a -G microk8s {}'.format(host.data.username),
        'mkdir /home/{}/.kube || true'.format(host.data.username),
        'chown -f -R {} /home/{}/.kube'.format(host.data.username, host.data.username)
    ],
    sudo=True
)

3 machines (rbox1-3) have fiber NIC in addition to ethernet NIC -- in the beginning of the cluster's existence; I have removed the slow node (rbox0) which was also the master -- that's probably why you saw the leader election issues, and I was debugging network issues so I unplugged one machine and temporarily 2 machines had 3 NICs

I also completely reinstalled one another machine (rbox1) -- it had trouble joining the cluster, microk8s.status would report mk8s is not running -- but it was running; this was fixed by:

microk8s.leave  //(on rbox1) 
microk8s.remove-node rbox1 //(on master node)
microk8s.add-node //(on master node)
microk8s.join .... //(on rbox1)

the cluster was very responsive, dns was however not working

I did this exact sequence of commands:

microk8s enable dns

poking around, trying to see if dns is working ..... it was not, but I had other more important tasks to do -- so one day later, I executed:

microk8s enable dns dashboard

and after the second one the api responses became sluggish (waiting 20s to execute kubectl get po)

I'm not saying it caused it, but I noticed the api rate limited messages only then.

So since then I did:

  1. applied your recomendations (no effect)
    
    files.line(
    '/var/snap/microk8s/current/args/kube-controller-manager',
    "--leader-elect-lease-duration=60s\n--leader-elect-renew-deadline=30s",
    )

files.line( '/var/snap/microk8s/current/args/kube-scheduler', "--leader-elect-lease-duration=60s\n--leader-elect-renew-deadline=30s", )

server.shell( name='Restart MicroK8s', commands=[ 'microk8s.stop', 'microk8s.start' ], sudo=True )

1. `microk8s.kubectl delete apiservice v1beta1.metrics.k8s.io` -- this has solved the problem; but it likely is just a symptom
2. `microk8s disable/enable dashboard` -- i did it twice, `apiservice v1beta1.metrics.k8s.io` is never added -- but `kubectl` continues to respond fast

My uneducated guess is that this cluster has problems accessing other machines (using DNS) -- but I was only able to get this far:

1. `microk8s disable dns` + `microk8s enable dns:192.168.3.1`  (pods can now ping machines inside the cluster by the hostname and `ping google.com` also works)
2. when a pod `foo` is placed on the same host as `coredns` pod, then from inside the pod `nslookup foo.cluster.local` succeeds, but nslookup of other pods (on other hosts) does not
3. I believe the traffic is not being forwarded by/from calico bridge -- but I struggle to see through all the `iptables-legacy` rules and I need to figure out how to easily debug `iptables` chains in general; it is worth mentioning that the machines have also docker installedin them, even though it is not used - here `ifconfig` from one worker

calib3c61c3cba9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1440 inet6 fe80::ecee:eeff:feee:eeee prefixlen 64 scopeid 0x20 ether ee:ee:ee:ee:ee:ee txqueuelen 0 (Ethernet) RX packets 23 bytes 1706 (1.7 KB) RX errors 0 dropped 1 overruns 0 frame 0 TX packets 7 bytes 746 (746.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 ether 02:42:1e:b8:6d:c1 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp16s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.3.72 netmask 255.255.255.0 broadcast 192.168.3.255 inet6 fe80::20f:53ff:fe37:2b90 prefixlen 64 scopeid 0x20 ether 00:0f:53:37:2b:90 txqueuelen 1000 (Ethernet) RX packets 50240259 bytes 67709907428 (67.7 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11247215 bytes 14248509224 (14.2 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 42

enp39s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.3.62 netmask 255.255.255.0 broadcast 192.168.3.255 inet6 fe80::2ef0:5dff:fe74:2929 prefixlen 64 scopeid 0x20 ether 2c:f0:5d:74:29:29 txqueuelen 1000 (Ethernet) RX packets 47449115 bytes 17909463984 (17.9 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 70394226 bytes 59496572197 (59.4 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 29170913 bytes 14473751480 (14.4 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 29170913 bytes 14473751480 (14.4 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410 inet 10.1.80.128 netmask 255.255.255.255 broadcast 10.1.80.128 inet6 fe80::643d:d0ff:fe26:2351 prefixlen 64 scopeid 0x20 ether 66:3d:d0:26:23:51 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16346 bytes 980760 (980.7 KB) TX errors 0 dropped 10 overruns 0 carrier 0 collisions 0


Btw, the (previous) master machine is not exactly the same OS as the other workers; it has `gdm` and `NetworkManager`

rchyla@rbox1:~$ microk8s.kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME rbox1 Ready 2d18h v1.20.7-34+df7df22a741dbc 192.168.3.71 Ubuntu 20.10 5.8.0-55-generic containerd://1.3.7 rbox0 Ready 32d v1.20.7-34+df7df22a741dbc 192.168.3.60 Ubuntu 20.04.2 LTS 5.8.0-53-generic containerd://1.3.7 rbox3 Ready 32d v1.20.7-34+df7df22a741dbc 192.168.3.73 Ubuntu 20.10 5.8.0-53-generic containerd://1.3.7 rbox2 Ready 32d v1.20.7-34+df7df22a741dbc 192.168.3.62 Ubuntu 20.10 5.8.0-53-generic containerd://1.3.7


calico documentation says that `NetworkManager` can interfere with the firewall rules, so I did this (without it having any impact)

if host.fact.directory('/etc/NetworkManager/conf.d'): files.line( '/etc/NetworkManager/conf.d/calico.conf', "[keyfile]\nunmanaged-devices=interface-name:cali;interface-name:tunl;interface-name:vxlan.calico", sudo=True, present=True, name="Adding NetworkManager config (to prevent interfering with calico)" )



so yeah, none of this might be relevant -- but I can easily rebuild the cluster, so I can try things (though then we would not know what caused the issues in the first place)
balchua commented 3 years ago

Hi @romanchyla Thanks for the information provided. First, it looks like the sluggish behavior can be attributed to the network issue you experience. Iirc, when the metrics server is installed, one can do kubectl top node which allows the apiserver to retrieve the info from the metrics server. Eventhough you are not issuing this top command, the apiserver may still be trying to get these or at least try.

I went back to the last tarball you posted. The iptables seems to be incomplete. As an example of my inspect tarball result of iptables. https://pastebin.ubuntu.com/p/6g4fs2HZg4/

Maybe if we fix the network related issue, we may find out if this is causing your kubectl commands to slow down. I am just guessing here though.

romanchyla commented 3 years ago

hi @balchua thanks for the example, i'll be trying to sift through them looking for differences; and experimenting...

romanchyla commented 3 years ago

Ok, sorry for the delay - I was called away by unrelated issues; when i got back to investigation, discovered that pods were not able to contact each other unless on the same host. There is iptables and iptables-legacy.

The issue is fixed by: sudo iptables --policy FORWARD ACCEPT

Pods can then get to other (remote) pods. Coredns is running and I can verify that nslookup is contacting the service.

rchyla@rbox0:~$ k logs -f --namespace=kube-system -l k8s-app=kube-dns
[INFO] 10.1.59.205:58081 - 7 "PTR IN 1.8.26.10.in-addr.arpa. udp 40 false 512" NXDOMAIN qr,rd,ra 40 0.017278138s
[INFO] 10.1.19.13:33968 - 6665 "PTR IN 1.3.168.192.in-addr.arpa. udp 42 false 512" NXDOMAIN qr,rd,ra 42 0.017339273s
[INFO] 10.1.19.13:56354 - 19473 "PTR IN 1.0.168.192.in-addr.arpa. udp 42 false 512" NXDOMAIN qr,rd,ra 42 0.023707314s
[INFO] 10.1.19.13:60706 - 2432 "PTR IN 1.8.26.10.in-addr.arpa. udp 40 false 512" NXDOMAIN qr,rd,ra 40 0.033363787s
[INFO] 10.1.19.13:42882 - 31195 "PTR IN 1.1.254.169.in-addr.arpa. udp 42 false 512" NXDOMAIN qr,rd,ra 42 0.018348662s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3dd5baecb1e49c4798bbe3b90922728b
[INFO] Reloading complete
[INFO] 127.0.0.1:57510 - 55260 "HINFO IN 2407861620103233522.4423466110065054833. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.025354985s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.default.svc.cluster.local. udp 53 false 512" NOERROR qr,aa,rd 146 0.000175328s
[INFO] 10.1.59.207:39344 - 1536 "A IN hostnames.default.svc.cluster.local. udp 53 false 512" NOERROR qr,aa,rd 104 0.000090419s
[INFO] 10.1.59.207:39344 - 1536 "A IN hostnames.svc.cluster.local. udp 45 false 512" NXDOMAIN qr,aa,rd 138 0.000080811s
[INFO] 10.1.59.207:39344 - 1536 "A IN hostnames.cluster.local. udp 41 false 512" NXDOMAIN qr,aa,rd 134 0.000074118s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.svc.cluster.local. udp 45 false 512" NXDOMAIN qr,aa,rd 138 0.000236181s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.cluster.local. udp 41 false 512" NXDOMAIN qr,aa,rd 134 0.00025151s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.lan. udp 31 false 512" NXDOMAIN qr,rd,ra 106 0.024329677s

but it gets no answer

[rcluster] rchyla@tp1:/dvt/workspace/projects$ k run -i --tty --rm debug3 --image=busybox --restart=Never --overrides='{"apiVersion":"v1","spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchFields":[{"key":"metadata.name","operator":"In","values":["rbox1"]}]}]}}}}}' -- sh
If you don't see a command prompt, try pressing enter.
/ # nslookup hostnames
Server:     10.152.183.10
Address:    10.152.183.10:53

*** Can't find hostnames.default.svc.cluster.local: No answer

*** Can't find hostnames.svc.cluster.local: No answer
*** Can't find hostnames.cluster.local: No answer
*** Can't find hostnames.lan: No answer
*** Can't find hostnames.default.svc.cluster.local: No answer
*** Can't find hostnames.svc.cluster.local: No answer
*** Can't find hostnames.cluster.local: No answer
*** Can't find hostnames.lan: No answer

I'm suspicious of how DNS records are being (not) updated. Also, is it expected that kube-proxy listens on localhost?

rchyla@rbox1:~$ ps aux | grep kube-proxy
root     1036300  0.0  0.6 744148 41920 ?        Ssl  Jun16   0:54 /snap/microk8s/2264/kube-proxy --kubeconfig=/var/snap/microk8s/2264/credentials/proxy.config --cluster-cidr=10.1.0.0/16 --healthz-bind-address=127.0.0.1
rchyla   3661712  0.0  0.0   6416   884 pts/0    S+   02:52   0:00 grep --color=auto kube-proxy
rchyla@rbox1:~$ cat /var/snap/microk8s/2264/credentials/proxy.config
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ...
    server: https://127.0.0.1:16443
  name: microk8s-cluster
contexts:
- context:
    cluster: microk8s-cluster
    user: kubeproxy
  name: microk8s
current-context: microk8s
kind: Config
preferences: {}
users:
- name: kubeproxy
  user:
    token: ...

and searching journalctl | grep kube-proxy turns out this message, which leads me to believe that DNS is not being updated properly

Jun 16 11:37:10 rbox1 microk8s.daemon-proxy[1036300]: E0616 11:37:10.290870 1036300 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rbox1.16890cf48ae043d4", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rbox1", UID:"rbox1", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"rbox1"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://127.0.0.1:16443/api/v1/namespaces/default/events": dial tcp 127.0.0.1:16443: connect: connection refused'(may retry after sleeping)
Jun 16 11:37:20 rbox1 microk8s.daemon-proxy[1036300]: E0616 11:37:20.291480 1036300 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rbox1.16890cf48ae043d4", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rbox1", UID:"rbox1", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"rbox1"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://127.0.0.1:16443/api/v1/namespaces/default/events": dial tcp 127.0.0.1:16443: connect: connection refused'(may retry after sleeping)
Jun 16 11:37:30 rbox1 microk8s.daemon-proxy[1036300]: E0616 11:37:30.292152 1036300 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rbox1.16890cf48ae043d4", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rbox1", UID:"rbox1", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"rbox1"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://127.0.0.1:16443/api/v1/namespaces/default/events": dial tcp 127.0.0.1:16443: connect: connection refused'(may retry after sleeping)

Any hint on how to debug dns updates; and whether kubernetes master should be accessible on 127.0.0.1:16443 would be much appreciated

romanchyla commented 3 years ago

@balchua and one more detail, kube-apiserver is listening using tcp/tcp6 protocol and 127.0.0.1/:::: interface - could that be relevant? ssh for example listens on both

rchyla@rbox1:~$ sudo netstat -tulpn
[sudo] password for rchyla: 
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:42723         0.0.0.0:*               LISTEN      1036280/containerd  
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      3548243/kubelet     
tcp        0      0 0.0.0.0:25000           0.0.0.0:*               LISTEN      1036572/python3     
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      1036300/kube-proxy  
tcp        0      0 127.0.0.1:10251         0.0.0.0:*               LISTEN      1302762/kube-schedu 
tcp        0      0 127.0.0.1:9099          0.0.0.0:*               LISTEN      1040514/calico-node 
tcp        0      0 127.0.0.1:10252         0.0.0.0:*               LISTEN      1302750/kube-contro 
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/init              
tcp        0      0 127.0.0.1:10256         0.0.0.0:*               LISTEN      1036300/kube-proxy  
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      774/systemd-resolve 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      898/sshd: /usr/sbin 
tcp        0      0 192.168.3.71:19001      0.0.0.0:*               LISTEN      1036282/kube-apiser 
tcp        0      0 127.0.0.1:6010          0.0.0.0:*               LISTEN      3012144/sshd: rchyl 
tcp        0      0 127.0.0.1:1338          0.0.0.0:*               LISTEN      1036280/containerd  
tcp6       0      0 :::10250                :::*                    LISTEN      3548243/kubelet     
tcp6       0      0 :::10255                :::*                    LISTEN      3548243/kubelet     
tcp6       0      0 :::111                  :::*                    LISTEN      1/init              
tcp6       0      0 :::10257                :::*                    LISTEN      1302750/kube-contro 
tcp6       0      0 :::10259                :::*                    LISTEN      1302762/kube-schedu 
tcp6       0      0 :::22                   :::*                    LISTEN      898/sshd: /usr/sbin 
tcp6       0      0 ::1:6010                :::*                    LISTEN      3012144/sshd: rchyl 
tcp6       0      0 :::16443                :::*                    LISTEN      1036282/kube-apiser 
udp        0      0 127.0.0.53:53           0.0.0.0:*                           774/systemd-resolve 
udp        0      0 192.168.3.71:68         0.0.0.0:*                           772/systemd-network 
udp        0      0 192.168.3.61:68         0.0.0.0:*                           772/systemd-network 
udp        0      0 0.0.0.0:111             0.0.0.0:*                           1/init              
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           -                   
udp6       0      0 :::111                  :::*                                1/init             
balchua commented 3 years ago

@romanchyla if you can upload all the inspect tarball, hopefully we can spot something. So the issue as of right now is that pods cannot access other pods via the Service right?

Wrt to the apiserver listening using ipv6, I remember disabling that when i build my cluster in DigitalOcean.

romanchyla commented 3 years ago

Thank you @balchua for your kind help. Yes, the pods can be accessed via their IP, but the services IP won't allow any communication. The tarball is attached. inspection-report-20210621_165116.tar.gz

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.