Closed BartoGabriel closed 1 year ago
I think this is normal as explained in calico documentation. https://docs.projectcalico.org/networking/mtu The calico setup in MicroK8s is with vxlan.
You are right. So I am very lost in being able to find the problem that I am having with those random connection errors. These mistakes sometimes happen other times they don't. Most of the time it happens in a DockerInDocker that I am running for the CI.
Not much clue here, but did you notice if calico pods are restarting?
I have no pod, no resource with some calico name. The kubectl get all --all-namespaces command doesn't return anything that appears to be calico.
The pod that I restarted and updated a few days ago is the coredns. But now it seems to work fine, since it doesn't have any strange logs.
I attach an error that also happens sometimes and is logged in the coredns pod:
.:53
[INFO] plugin/reload: Running configuration MD5 = be0f52d3c13480652e0c73672f2fa263
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae
W0629 16:30:35.815179 1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: watch of *v1.Namespace ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
E0629 16:30:37.241482 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.152.183.1:443/api/v1/namespaces?resourceVersion=80868361": dial tcp 10.152.183.1:443: connect: connection refused
E0629 16:30:39.044949 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.152.183.1:443/api/v1/namespaces?resourceVersion=80868361": dial tcp 10.152.183.1:443: connect: connection refused
E0629 16:30:42.507358 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.152.183.1:443/api/v1/namespaces?resourceVersion=80868361": dial tcp 10.152.183.1:443: connect: connection refused
This happens randomly when trying to run a job in my CI environment (GitLab):
Hi @BartoGabriel Thanks for the response. Hmmm. No calico pods indicate that you are using a non HA MicroK8s. Which is going to use flannel. Are you ok to upload the inspect tarball?
@balchua thank you very much for your help, I am attaching the file to see if you can help me:
I think kubelite panics because the controller manager or the scheduler couldn't elect a leader.
51 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:49.089003 2819550 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded" containerID="0dcd00da89764231df75292e6cd11c53afc954e80562bd2465a7ff229033a3ea" cmd=[/bin/bash /configmaps/check-live]
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: I0629 16:22:52.105602 2819550 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:52.230069 2819550 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:52.230209 2819550 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:52.255422 2819550 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: I0629 16:22:52.255528 2819550 trace.go:205] Trace[840929659]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager,user-agent:kubelite/v1.21.1 (linux/amd64) kubernetes/1f02fea/leader-election,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (29-Jun-2021 16:22:08.737) (total time: 43518ms):
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: Trace[840929659]: [43.518429213s] [43.518429213s] END
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: I0629 16:22:52.521022 2819550 leaderelection.go:278] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:52.566172 2819550 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:52.566223 2819550 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
Jun 29 16:22:52 microk8s microk8s.daemon-kubelite[2819550]: E0629 16:22:52.582463 2819550 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
Jun 29 16:22:53 microk8s microk8s.daemon-kubelite[2819550]: I0629 16:22:53.186303 2819550 client.go:360] parsed scheme: "passthrough"
Jun 29 16:22:53 microk8s microk8s.daemon-kubelite[2819550]: I0629 16:22:53.231893 2819550 trace.go:205] Trace[2014340278]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler,user-agent:kubelite/v1.21.1 (linux/amd64) kubernetes/1f02fea/leader-election,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (29-Jun-2021 16:22:08.833) (total time: 44398ms):
Jun 29 16:22:53 microk8s microk8s.daemon-kubelite[2819550]: Trace[2014340278]: [44.39841045s] [44.39841045s] END
Jun 29 16:22:54 microk8s microk8s.daemon-kubelite[2819550]: I0629 16:22:53.886969 2819550 trace.go:205] Trace[524070731]: "iptables ChainExists" (29-Jun-2021 16:22:50.625) (total time: 3261ms):
Jun 29 16:22:54 microk8s microk8s.daemon-kubelite[2819550]: Trace[524070731]: [3.261729803s] [3.261729803s] END
Jun 29 16:22:54 microk8s microk8s.daemon-kubelite[2819550]: F0629 16:22:53.973013 2819550 server.go:205] leaderelection lost
The last line in the logs is a fatal error. Its about the same time as the log you shown above.
You can try to increase the following
--leader-elect-lease-duration=60s
--leader-elect-renew-deadline=30s
On these 2 files
/var/snap/microk8s/current/args/kube-controller-manager
/var/snap/microk8s/current/args/kube-scheduler
Then restart MicroK8s. I am also curious if you happen to have a slow disk? Etcd (which you are using) is kindda sensitive to slow storage.
Thank you again!
I'm going to try these new configuration parameters and watch out for errors.
I don't think it's a disk speed problem. The physical machine is an IBM server with the disks in RAID. The virtualization software we use is XCP-ng. And the virtual machine has an Ubuntu Server 20.04.2 LTS.
I also show you the output of the following commands so you can see the disk read and write speeds:
dev@microk8s:~$ sudo hdparm -Tt /dev/xvda
/dev/xvda:
Timing cached reads: 15494 MB in 1.99 seconds = 7795.34 MB/sec
HDIO_DRIVE_CMD(identify) failed: Invalid argument
Timing buffered disk reads: 928 MB in 3.00 seconds = 308.88 MB/sec
dev@microk8s:~$ dd if=/dev/zero of=/tmp/output bs=8k count=10k; rm -f /tmp/output
10240+0 records in
10240+0 records out
83886080 bytes (84 MB, 80 MiB) copied, 0.115295 s, 728 MB/s
I think they are acceptable speeds.
You're right, the disk speeds looks good. Got the chance to test out the configuration? Also do you have any firewall running?
I was not lucky. Random errors keep happening.... :(
In the minutes it fails, all CI processes that create pods begin to fail.
Hi @BartoGabriel Is this cluster been running for sometime? Or it is a new installation? I can also see that this is a single node cluster. Can you also upload the inspect tarball again? Lets hope something else is revealed.
@BartoGabriel i went back to your logs. I can see that you are running with swap enabled.
total used free shared buff/cache available
Mem: 11946 3428 2183 1 6333 8523
Swap: 3899 576 3323
This is not recommended by kubernetes. Can you try turning swap off? May i also know how many cpu this node has?
The leadeship lost is usually caused by the backend store unable to keep up with kubernetes.
You may want to bounce the server after changing the swap. Be sure to stop microk8s before restarting the node.
I've been googling, and you're right, swap affects performance and can cause key parts to collapse.
It is only 1 node, it is running all microk8s on a single PC. In a few months we are going to activate a new node, but first I want to get this main node working properly.
This was active because of the way the machine was created:
An important fact: When I updated from version 19 to 21, I noticed that the coredns image had not been updated, I updated it manually (changing the deploy) from 1.2 to 1.8. I don't know if this has something to do with it.
Queries:
I also attach the new watch file (after disabling swap): inspection-report-20210705_125138.tar.gz
PC information:
The physical server has 24 cores. 10 Cores are intended for microk8s.
Virtual PC information:
Disk:
Usage of /: 11.7% of 392.48GB
CPU
dev@microk8s:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5200.02
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 0
siblings : 2
core id : 2
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5200.02
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 1
siblings : 2
core id : 0
cpu cores : 2
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5210.70
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 1
siblings : 2
core id : 2
cpu cores : 2
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5210.70
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 2
siblings : 2
core id : 0
cpu cores : 2
apicid : 8
initial apicid : 8
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5211.09
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 2
siblings : 2
core id : 2
cpu cores : 2
apicid : 10
initial apicid : 10
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5211.09
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 2
apicid : 12
initial apicid : 12
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5210.64
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 3
siblings : 2
core id : 2
cpu cores : 2
apicid : 14
initial apicid : 14
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5210.64
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 8
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 4
siblings : 2
core id : 0
cpu cores : 2
apicid : 16
initial apicid : 16
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5209.65
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 9
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2599.833
cache size : 15360 KB
physical id : 4
siblings : 2
core id : 2
cpu cores : 2
apicid : 18
initial apicid : 18
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good nopl cpuid pni ssse3 cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5209.65
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Memory
dev@microk8s:~$ cat /proc/meminfo
MemTotal: 12232764 kB
MemFree: 6985864 kB
MemAvailable: 9136252 kB
Buffers: 244852 kB
Cached: 1816388 kB
SwapCached: 0 kB
Active: 2637500 kB
Inactive: 1690668 kB
Active(anon): 2086636 kB
Inactive(anon): 920 kB
Active(file): 550864 kB
Inactive(file): 1689748 kB
Unevictable: 18484 kB
Mlocked: 18484 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 1092 kB
Writeback: 0 kB
AnonPages: 2285452 kB
Mapped: 882524 kB
Shmem: 4096 kB
KReclaimable: 228476 kB
Slab: 604704 kB
SReclaimable: 228476 kB
SUnreclaim: 376228 kB
KernelStack: 31628 kB
PageTables: 21616 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6116380 kB
Committed_AS: 13951764 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 44544 kB
VmallocChunk: 0 kB
Percpu: 202240 kB
HardwareCorrupted: 0 kB
AnonHugePages: 106496 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 333824 kB
DirectMap2M: 12240896 kB
Did turning off swap settings alleviate the issue you are facing?
Do you think 12mb RAN and 10 CPU are enough for a single node? (I know it depends on the load of the pods, but it is only for a CI that executes compilations and unit tests, of a small team of development).
I find that Kubernetes consumes most system resource when you create and destroy kubernetes resources very often. Like creating several pods and terminating it after a short while. How active will the CI be? Before bumping up your node's memory or CPU i will first monitor the cluster. So 12 GB is a good start IMHO.
Do you think that the installation was dirty, and a new installation should be created?
I wouldn't call it dirty, but I will start with clean slate this time if possible. Also remember to pin it to a specific channel like 1.21/stable for example.
So far it is working very well. There was no failure of any kind.
I think this is a very important point (swap memory) and it should be clarified in the documentation. Or the installation process should warn the user.
@balchua Thank you very much for all your help
Good to hear that.
There is an option in the kubelet to fail when swap is on.
But in MicroK8s it is set to false --fail-swap-on=false
. I don't have any recollection why this is set to false.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm running some pods (from CI) that run:
And random errors are happening to me. There are times that it works well, others that it throws me the following error:
At first I thought it was a DNS problem, and I ran multiple tests and couldn't fix it.
The weirdest thing about all this is that it is very random. And I can't always fake it.
Investigating a bit, I could see that it happens to other users, and the error is that the MTU of the docker interfaces does not match that of the host.
So, I checked it on my server. And indeed the MTU is not the same:
How could I try to change the MTUs of all the interfaces that microk8s creates and uses, so they are all 1500 like eth0?