Closed romanchyla closed 1 year ago
hi @romanchyla
I think both the kube controller manager and kube-scheduler is crashing, when it loses its leader election.
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.052546 3159596 shared_informer.go:247] Caches are synced for garbage collector
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.053023 3159596 garbagecollector.go:151] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.057285 3159596 shared_informer.go:247] Caches are synced for garbage collector
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.888294 3159596 shared_informer.go:240] Waiting for caches to sync for resource quota
Jun 04 19:08:16 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:08:16.888373 3159596 shared_informer.go:247] Caches are synced for resource quota
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: E0604 19:09:43.686552 3159596 leaderelection.go:361] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=10s": context deadline exceeded
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: I0604 19:09:43.686737 3159596 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: F0604 19:09:43.686870 3159596 controllermanager.go:294] leaderelection lost
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: goroutine 1 [running]:
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc000ee7e00, 0x4c, 0x1e9)
Jun 04 19:09:43 rbox0 microk8s.daemon-controller-manager[3159596]: /build/microk8s/parts/k8s-binaries/build/go/src/github.com/kubernetes/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
same for the kube-scheduler.
Jun 04 18:46:17 rbox0 microk8s.daemon-scheduler[3059029]: I0604 18:46:17.660780 3059029 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: E0604 19:05:27.956209 3059029 leaderelection.go:361] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s": context deadline exceeded
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: I0604 19:05:27.956291 3059029 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: F0604 19:05:27.956320 3059029 server.go:205] leaderelection lost
Jun 04 19:05:27 rbox0 microk8s.daemon-scheduler[3059029]: goroutine 1 [running]:
Want to give this a try? https://github.com/ubuntu/microk8s/issues/2205#issuecomment-827208050
thanks @balchua - tried, applied the change on the nodes and restarted (they were restarting one after another, but quickly), it hasn't changed
with microk8s.kubectl delete apiservice v1beta1.metrics.k8s.io
the messages however disappeared and the kubectl response time increases dramatically. Again, it might be a coincidence, but (to me at least) it seems related to something that was left over after microk8s enable dashboard
this was the objects definition before I removed it (I deleted the timestamp and uuid; in effort to force the network edits to pass - unsuccessfully) inspection-report-20210612_071710.tar.gz
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{},"name":"v1beta1.metrics.k8s.io"},"spec":{"group":"metrics.k8s.io","groupPriorityMinimum":100,"insecureSkipTLSVerify":true,"service":{"name":"metrics-server","namespace":"kube-system"},"version":"v1beta1","versionPriority":100}}
name: v1beta1.metrics.k8s.io
selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
port: 443
version: v1beta1
versionPriority: 100
status:
conditions:
- lastTransitionTime: "2021-06-12T07:08:34Z"
message: all checks passed
reason: Passed
status: "True"
type: Available```
This may have come from the metrics server that came with the dashboard addon.
This seem to slow down the datastore (dqlite in this case).
We've seen issues like this when the storage is slow. In the docs there are steps to identify the performance of the rpi storage.
https://microk8s.io/docs/troubleshooting#heading--common-issues
May i know these are the steps you did?
Wanted to see if this can be replicated. @ktsakalozos in case you have some idea. 😊
Update: I also noticed that the metrics server couldn't connect to the kubelet.
E0612 07:16:31.050856 1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:rbox1: unable to fetch metrics from Kubelet rbox1 (192.168.3.71): Get https://192.168.3.71:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.3.71:10250: i/o timeout, unable to fully scrape metrics from source kubelet_summary:rbox2: unable to fetch metrics from Kubelet rbox2 (192.168.3.62): Get https://192.168.3.62:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.3.62:10250: i/o timeout, unable to fully scrape metrics from source kubelet_summary:rbox0: unable to fetch metrics from Kubelet rbox0 (192.168.3.60): Get https://192.168.3.60:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.3.60:10250: i/o timeout]
I don't see anything suspicious in the apiserver.
The slow dqlite is unlikely in my case; the cluster has NVME drives -- they are in fact very fast. There is one machine with SATA SSD (much slower than NVME, but still capable of writing 600MB/s)
The DNS inside the cluster is however not working correctly and since yesterday I'm trying to figure out why (it never worked) - as a learning experiment, I can easily rebuild the cluster - so I'm happy to try any suggestions you might have.
Here is how the cluster was built:
one every machine:
server.shell(
name='Install Micro k8s',
commands=[
'snap install microk8s --classic',
'usermod -a -G microk8s {}'.format(host.data.username),
'mkdir /home/{}/.kube || true'.format(host.data.username),
'chown -f -R {} /home/{}/.kube'.format(host.data.username, host.data.username)
],
sudo=True
)
3 machines (rbox1-3) have fiber NIC in addition to ethernet NIC -- in the beginning of the cluster's existence; I have removed the slow node (rbox0) which was also the master -- that's probably why you saw the leader election issues, and I was debugging network issues so I unplugged one machine and temporarily 2 machines had 3 NICs
I also completely reinstalled one another machine (rbox1) -- it had trouble joining the cluster, microk8s.status
would report mk8s is not running
-- but it was running; this was fixed by:
microk8s.leave //(on rbox1)
microk8s.remove-node rbox1 //(on master node)
microk8s.add-node //(on master node)
microk8s.join .... //(on rbox1)
the cluster was very responsive, dns was however not working
I did this exact sequence of commands:
microk8s enable dns
poking around, trying to see if dns is working ..... it was not, but I had other more important tasks to do -- so one day later, I executed:
microk8s enable dns dashboard
and after the second one the api responses became sluggish (waiting 20s to execute kubectl get po
)
I'm not saying it caused it, but I noticed the api rate limited
messages only then.
So since then I did:
files.line(
'/var/snap/microk8s/current/args/kube-controller-manager',
"--leader-elect-lease-duration=60s\n--leader-elect-renew-deadline=30s",
)
files.line( '/var/snap/microk8s/current/args/kube-scheduler', "--leader-elect-lease-duration=60s\n--leader-elect-renew-deadline=30s", )
server.shell( name='Restart MicroK8s', commands=[ 'microk8s.stop', 'microk8s.start' ], sudo=True )
1. `microk8s.kubectl delete apiservice v1beta1.metrics.k8s.io` -- this has solved the problem; but it likely is just a symptom
2. `microk8s disable/enable dashboard` -- i did it twice, `apiservice v1beta1.metrics.k8s.io` is never added -- but `kubectl` continues to respond fast
My uneducated guess is that this cluster has problems accessing other machines (using DNS) -- but I was only able to get this far:
1. `microk8s disable dns` + `microk8s enable dns:192.168.3.1` (pods can now ping machines inside the cluster by the hostname and `ping google.com` also works)
2. when a pod `foo` is placed on the same host as `coredns` pod, then from inside the pod `nslookup foo.cluster.local` succeeds, but nslookup of other pods (on other hosts) does not
3. I believe the traffic is not being forwarded by/from calico bridge -- but I struggle to see through all the `iptables-legacy` rules and I need to figure out how to easily debug `iptables` chains in general; it is worth mentioning that the machines have also docker installedin them, even though it is not used - here `ifconfig` from one worker
calib3c61c3cba9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1440 inet6 fe80::ecee:eeff:feee:eeee prefixlen 64 scopeid 0x20 ether ee:ee:ee:ee:ee:ee txqueuelen 0 (Ethernet) RX packets 23 bytes 1706 (1.7 KB) RX errors 0 dropped 1 overruns 0 frame 0 TX packets 7 bytes 746 (746.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 ether 02:42:1e:b8:6d:c1 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp16s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 192.168.3.72 netmask 255.255.255.0 broadcast 192.168.3.255 inet6 fe80::20f:53ff:fe37:2b90 prefixlen 64 scopeid 0x20 ether 00:0f:53:37:2b:90 txqueuelen 1000 (Ethernet) RX packets 50240259 bytes 67709907428 (67.7 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11247215 bytes 14248509224 (14.2 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 42
enp39s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.3.62 netmask 255.255.255.0 broadcast 192.168.3.255 inet6 fe80::2ef0:5dff:fe74:2929 prefixlen 64 scopeid 0x20 ether 2c:f0:5d:74:29:29 txqueuelen 1000 (Ethernet) RX packets 47449115 bytes 17909463984 (17.9 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 70394226 bytes 59496572197 (59.4 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
vxlan.calico: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410 inet 10.1.80.128 netmask 255.255.255.255 broadcast 10.1.80.128 inet6 fe80::643d:d0ff:fe26:2351 prefixlen 64 scopeid 0x20 ether 66:3d:d0:26:23:51 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16346 bytes 980760 (980.7 KB) TX errors 0 dropped 10 overruns 0 carrier 0 collisions 0
Btw, the (previous) master machine is not exactly the same OS as the other workers; it has `gdm` and `NetworkManager`
rchyla@rbox1:~$ microk8s.kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
rbox1 Ready
calico documentation says that `NetworkManager` can interfere with the firewall rules, so I did this (without it having any impact)
if host.fact.directory('/etc/NetworkManager/conf.d'): files.line( '/etc/NetworkManager/conf.d/calico.conf', "[keyfile]\nunmanaged-devices=interface-name:cali;interface-name:tunl;interface-name:vxlan.calico", sudo=True, present=True, name="Adding NetworkManager config (to prevent interfering with calico)" )
so yeah, none of this might be relevant -- but I can easily rebuild the cluster, so I can try things (though then we would not know what caused the issues in the first place)
Hi @romanchyla
Thanks for the information provided. First, it looks like the sluggish behavior can be attributed to the network issue you experience. Iirc, when the metrics server is installed, one can do kubectl top node
which allows the apiserver to retrieve the info from the metrics server.
Eventhough you are not issuing this top command, the apiserver may still be trying to get these or at least try.
I went back to the last tarball you posted. The iptables seems to be incomplete. As an example of my inspect tarball result of iptables. https://pastebin.ubuntu.com/p/6g4fs2HZg4/
Maybe if we fix the network related issue, we may find out if this is causing your kubectl commands to slow down. I am just guessing here though.
hi @balchua thanks for the example, i'll be trying to sift through them looking for differences; and experimenting...
Ok, sorry for the delay - I was called away by unrelated issues; when i got back to investigation, discovered that pods were not able to contact each other unless on the same host. There is iptables
and iptables-legacy
.
The issue is fixed by: sudo iptables --policy FORWARD ACCEPT
Pods can then get to other (remote) pods. Coredns is running and I can verify that nslookup is contacting the service.
rchyla@rbox0:~$ k logs -f --namespace=kube-system -l k8s-app=kube-dns
[INFO] 10.1.59.205:58081 - 7 "PTR IN 1.8.26.10.in-addr.arpa. udp 40 false 512" NXDOMAIN qr,rd,ra 40 0.017278138s
[INFO] 10.1.19.13:33968 - 6665 "PTR IN 1.3.168.192.in-addr.arpa. udp 42 false 512" NXDOMAIN qr,rd,ra 42 0.017339273s
[INFO] 10.1.19.13:56354 - 19473 "PTR IN 1.0.168.192.in-addr.arpa. udp 42 false 512" NXDOMAIN qr,rd,ra 42 0.023707314s
[INFO] 10.1.19.13:60706 - 2432 "PTR IN 1.8.26.10.in-addr.arpa. udp 40 false 512" NXDOMAIN qr,rd,ra 40 0.033363787s
[INFO] 10.1.19.13:42882 - 31195 "PTR IN 1.1.254.169.in-addr.arpa. udp 42 false 512" NXDOMAIN qr,rd,ra 42 0.018348662s
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3dd5baecb1e49c4798bbe3b90922728b
[INFO] Reloading complete
[INFO] 127.0.0.1:57510 - 55260 "HINFO IN 2407861620103233522.4423466110065054833. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.025354985s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.default.svc.cluster.local. udp 53 false 512" NOERROR qr,aa,rd 146 0.000175328s
[INFO] 10.1.59.207:39344 - 1536 "A IN hostnames.default.svc.cluster.local. udp 53 false 512" NOERROR qr,aa,rd 104 0.000090419s
[INFO] 10.1.59.207:39344 - 1536 "A IN hostnames.svc.cluster.local. udp 45 false 512" NXDOMAIN qr,aa,rd 138 0.000080811s
[INFO] 10.1.59.207:39344 - 1536 "A IN hostnames.cluster.local. udp 41 false 512" NXDOMAIN qr,aa,rd 134 0.000074118s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.svc.cluster.local. udp 45 false 512" NXDOMAIN qr,aa,rd 138 0.000236181s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.cluster.local. udp 41 false 512" NXDOMAIN qr,aa,rd 134 0.00025151s
[INFO] 10.1.59.207:39344 - 1536 "AAAA IN hostnames.lan. udp 31 false 512" NXDOMAIN qr,rd,ra 106 0.024329677s
but it gets no answer
[rcluster] rchyla@tp1:/dvt/workspace/projects$ k run -i --tty --rm debug3 --image=busybox --restart=Never --overrides='{"apiVersion":"v1","spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchFields":[{"key":"metadata.name","operator":"In","values":["rbox1"]}]}]}}}}}' -- sh
If you don't see a command prompt, try pressing enter.
/ # nslookup hostnames
Server: 10.152.183.10
Address: 10.152.183.10:53
*** Can't find hostnames.default.svc.cluster.local: No answer
*** Can't find hostnames.svc.cluster.local: No answer
*** Can't find hostnames.cluster.local: No answer
*** Can't find hostnames.lan: No answer
*** Can't find hostnames.default.svc.cluster.local: No answer
*** Can't find hostnames.svc.cluster.local: No answer
*** Can't find hostnames.cluster.local: No answer
*** Can't find hostnames.lan: No answer
I'm suspicious of how DNS records are being (not) updated. Also, is it expected that kube-proxy listens on localhost?
rchyla@rbox1:~$ ps aux | grep kube-proxy
root 1036300 0.0 0.6 744148 41920 ? Ssl Jun16 0:54 /snap/microk8s/2264/kube-proxy --kubeconfig=/var/snap/microk8s/2264/credentials/proxy.config --cluster-cidr=10.1.0.0/16 --healthz-bind-address=127.0.0.1
rchyla 3661712 0.0 0.0 6416 884 pts/0 S+ 02:52 0:00 grep --color=auto kube-proxy
rchyla@rbox1:~$ cat /var/snap/microk8s/2264/credentials/proxy.config
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: ...
server: https://127.0.0.1:16443
name: microk8s-cluster
contexts:
- context:
cluster: microk8s-cluster
user: kubeproxy
name: microk8s
current-context: microk8s
kind: Config
preferences: {}
users:
- name: kubeproxy
user:
token: ...
and searching journalctl | grep kube-proxy
turns out this message, which leads me to believe that DNS is not being updated properly
Jun 16 11:37:10 rbox1 microk8s.daemon-proxy[1036300]: E0616 11:37:10.290870 1036300 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rbox1.16890cf48ae043d4", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rbox1", UID:"rbox1", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"rbox1"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://127.0.0.1:16443/api/v1/namespaces/default/events": dial tcp 127.0.0.1:16443: connect: connection refused'(may retry after sleeping)
Jun 16 11:37:20 rbox1 microk8s.daemon-proxy[1036300]: E0616 11:37:20.291480 1036300 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rbox1.16890cf48ae043d4", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rbox1", UID:"rbox1", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"rbox1"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://127.0.0.1:16443/api/v1/namespaces/default/events": dial tcp 127.0.0.1:16443: connect: connection refused'(may retry after sleeping)
Jun 16 11:37:30 rbox1 microk8s.daemon-proxy[1036300]: E0616 11:37:30.292152 1036300 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rbox1.16890cf48ae043d4", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"rbox1", UID:"rbox1", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"rbox1"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc02a966e451549d4, ext:183900161554, loc:(*time.Location)(0x29c71a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://127.0.0.1:16443/api/v1/namespaces/default/events": dial tcp 127.0.0.1:16443: connect: connection refused'(may retry after sleeping)
Any hint on how to debug dns updates; and whether kubernetes master should be accessible on 127.0.0.1:16443
would be much appreciated
@balchua and one more detail, kube-apiserver is listening using tcp/tcp6 protocol and 127.0.0.1/::::
interface - could that be relevant? ssh
for example listens on both
rchyla@rbox1:~$ sudo netstat -tulpn
[sudo] password for rchyla:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:42723 0.0.0.0:* LISTEN 1036280/containerd
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 3548243/kubelet
tcp 0 0 0.0.0.0:25000 0.0.0.0:* LISTEN 1036572/python3
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 1036300/kube-proxy
tcp 0 0 127.0.0.1:10251 0.0.0.0:* LISTEN 1302762/kube-schedu
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 1040514/calico-node
tcp 0 0 127.0.0.1:10252 0.0.0.0:* LISTEN 1302750/kube-contro
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/init
tcp 0 0 127.0.0.1:10256 0.0.0.0:* LISTEN 1036300/kube-proxy
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 774/systemd-resolve
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 898/sshd: /usr/sbin
tcp 0 0 192.168.3.71:19001 0.0.0.0:* LISTEN 1036282/kube-apiser
tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN 3012144/sshd: rchyl
tcp 0 0 127.0.0.1:1338 0.0.0.0:* LISTEN 1036280/containerd
tcp6 0 0 :::10250 :::* LISTEN 3548243/kubelet
tcp6 0 0 :::10255 :::* LISTEN 3548243/kubelet
tcp6 0 0 :::111 :::* LISTEN 1/init
tcp6 0 0 :::10257 :::* LISTEN 1302750/kube-contro
tcp6 0 0 :::10259 :::* LISTEN 1302762/kube-schedu
tcp6 0 0 :::22 :::* LISTEN 898/sshd: /usr/sbin
tcp6 0 0 ::1:6010 :::* LISTEN 3012144/sshd: rchyl
tcp6 0 0 :::16443 :::* LISTEN 1036282/kube-apiser
udp 0 0 127.0.0.53:53 0.0.0.0:* 774/systemd-resolve
udp 0 0 192.168.3.71:68 0.0.0.0:* 772/systemd-network
udp 0 0 192.168.3.61:68 0.0.0.0:* 772/systemd-network
udp 0 0 0.0.0.0:111 0.0.0.0:* 1/init
udp 0 0 0.0.0.0:4789 0.0.0.0:* -
udp6 0 0 :::111 :::* 1/init
@romanchyla if you can upload all the inspect tarball, hopefully we can spot something.
So the issue as of right now is that pods cannot access other pods via the Service
right?
Wrt to the apiserver listening using ipv6, I remember disabling that when i build my cluster in DigitalOcean.
Thank you @balchua for your kind help. Yes, the pods can be accessed via their IP, but the services IP won't allow any communication. The tarball is attached. inspection-report-20210621_165116.tar.gz
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Apologies for not knowing how to debug this properly; it might be just a coincidence --
I have 4 node cluster,
microk8s.kubectl
is answering swiftly but suddenly aftermickrok8s enable dashboard
there is a noticeable delay of ~10sIf I remove the (previous) master node
microk8s leave
then again cluster answers quicklyWhen I rejoin the cluster; after what seems like initial 5mins, then again the slow responses
inspection-report-20210611_203559.tar.gz
journalctl -f -u snap.microk8s.daemon-apiserver
is full with these log entries: