Closed sebastienle14 closed 9 months ago
Hi @sebastienle14, thank you for filing the issue.
What does the routing table on your host look like? Also, I assume you can ping/curl the addresses of the VMs directly? E.g. 192.168.100.71
?
greetings @neoaggelos
Indeed, I can:
root@host:~# ip route show
default via 192.168.100.1 dev net100 proto static
192.168.100.0/24 dev net100 proto kernel scope link src 192.168.100.56
192.168.102.0/24 dev net102 proto kernel scope link src 192.168.102.4
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown
Here is the VM launch command:
libvirt+ 15597 21.8 1.1 2784320 2209108 ? Sl 11:37 77:34 /usr/bin/qemu-system-x86_64 -name guest=eck-master,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-14-eck-master/master-key.aes"} -machine pc-i440fx-impish,accel=kvm,usb=off,vmport=off,dump-guest-core=off,memory-backend=pc.ram -cpu EPYC-Rome,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,cmp-legacy=on,ibrs=on,amd-ssbd=on,virt-ssbd=on,svme-addr-chk=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on -m 2048 -object {"qom-type":"memory-backend-ram","id":"pc.ram","size":2147483648} -overcommit mem-lock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 139fe967-bab8-4063-b631-c8d8511d9b39 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=43,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -blockdev {"driver":"file","filename":"/srv/vms/eck-master.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"driver":"qcow2","file":"libvirt-2-storage","backing":null} -device ide-hd,bus=ide.0,unit=0,drive=libvirt-2-format,id=ide0-0-0,bootindex=1 -device ide-cd,bus=ide.0,unit=1,id=ide0-0-1 -netdev tap,fd=48,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:9f:b0:4d,bus=pci.0,addr=0x3 -netdev tap,fd=49,id=hostnet1 -device e1000,netdev=hostnet1,id=net1,mac=52:54:00:ce:4e:bc,bus=pci.0,addr=0x4 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -audiodev id=audio1,driver=spice -spice port=5903,addr=127.0.0.1,disable-ticketing=on,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x5 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0,audiodev=audio1 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
I believe this output shall be necessary too, sorry I forgot to add it in OP:
root@eck-master:~# kubectl get svc -o wide -n eck
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
elasticsearch-es-transport ClusterIP None <none> 9300/TCP 26h common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=elasticsearch
elasticsearch-es-http ClusterIP 10.152.183.14 <none> 9200/TCP 26h common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=elasticsearch
elasticsearch-es-internal-http ClusterIP 10.152.183.87 <none> 9200/TCP 26h common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=elasticsearch
kibana-kb-http ClusterIP 10.152.183.108 <none> 5601/TCP 26h common.k8s.elastic.co/type=kibana,kibana.k8s.elastic.co/name=kibana
elasticsearch-es-default ClusterIP None <none> 9200/TCP 26h common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=elasticsearch,elasticsearch.k8s.elastic.co/statefulset-name=elasticsearch-es-default
ingress-nginx-controller-admission ClusterIP 10.152.183.98 <none> 443/TCP 26h app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
cert-manager ClusterIP 10.152.183.196 <none> 9402/TCP 26h app.kubernetes.io/component=controller,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=cert-manager
cert-manager-webhook ClusterIP 10.152.183.30 <none> 443/TCP 26h app.kubernetes.io/component=webhook,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=webhook
nginx-elk ClusterIP 10.152.183.123 <none> 443/TCP 26h app=nginx-elk,release=nginx-elk
fleet-server-agent-http ClusterIP 10.152.183.130 <none> 8220/TCP 26h agent.k8s.elastic.co/name=fleet-server,common.k8s.elastic.co/type=agent
logstash-prod-logstash-headless ClusterIP None <none> 9600/TCP 5h7m app=logstash-prod-logstash
logstash-prod-logstash LoadBalancer 10.152.183.15 192.168.100.101 5044:31489/TCP 5h7m app=logstash-prod-logstash,chart=logstash,release=logstash-prod
ingress-nginx-controller LoadBalancer 10.152.183.92 192.168.100.100 80:32092/TCP,443:30932/TCP 26h app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
I am unable to reproduce this unfortunately, using a similar setup (libvirt hosts, bridged networking), the LB IP is reachable as it should.
If you use a NodePort service, can you reach your service from the host?
What is the networking equipment? Is this on a cloud environment? What does the routing table look like in the VMS? Can you use tcpdump and see if traffic reaches any of the VMs (though this seems unlikely due to the ? Can you double check the firewall rules in the ho
Do you see any warning logs in the metallb pods?
Our Hosting team confirms all networking the host is plugged to are managed by a CISCO Nexus 3064-T, 48 x 10GBase appliance, which should be manageable but they know nothing about yet (they are waiting for the course on this matter). At the moment, this appliance is managed by one of our partners, and not ourselves.
I will setup tcpdump on both sides (host & VMs) today and will get back to you.
We tried with nodePort service, and the same issue occured.
root@eck-master:~# kubectl -n eck get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
<snipped>
ingress-nginx-controller NodePort 10.152.183.92 <none> 80:32092/TCP,443:30932/TCP 21h
root@host ~# nmap -p1-65535 192.168.100.70
Starting Nmap 7.80 ( https://nmap.org ) at 2022-04-06 10:31 UTC
Nmap scan report for eck-master (192.168.100.70)
Host is up (0.0000070s latency).
Not shown: 65523 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
10250/tcp open unknown
10255/tcp open unknown
10257/tcp open unknown
10259/tcp open unknown
16443/tcp open unknown
19001/tcp open unknown
25000/tcp open icl-twobase1
30932/tcp filtered unknown
32092/tcp filtered unknown
32381/tcp filtered unknown
Here are the routing tables for the VMs:
root@eck-master:~# ip route show
default via 192.168.100.1 dev ens3 proto static
10.1.97.128/26 via 10.1.97.128 dev vxlan.calico onlink
10.1.102.0/26 via 10.1.102.0 dev vxlan.calico onlink
10.1.118.208 dev cali4790718892a scope link
10.1.118.209 dev calie0d8cbdd008 scope link
10.1.194.128/26 via 10.1.194.128 dev vxlan.calico onlink
192.168.100.0/24 dev ens3 proto kernel scope link src 192.168.100.70
192.168.102.0/24 dev ens4 proto kernel scope link src 192.168.102.70
root@eck-worker01:~# ip route show
default via 192.168.100.1 dev ens3 proto static
10.1.97.128/26 via 10.1.97.128 dev vxlan.calico onlink
10.1.102.0/26 via 10.1.102.0 dev vxlan.calico onlink
10.1.118.192/26 via 10.1.118.192 dev vxlan.calico onlink
10.1.194.129 dev cali736072e4537 scope link
10.1.194.131 dev calia28b2c14776 scope link
10.1.194.132 dev cali34e6a0f39f4 scope link
10.1.194.134 dev cali0ebfa45a26d scope link
10.1.194.135 dev cali5153b943d31 scope link
10.1.194.136 dev calic4101f61c36 scope link
10.1.194.137 dev calic9ba359cc71 scope link
10.1.194.138 dev cali79a8d91cda9 scope link
10.1.194.142 dev calibb1f5dc25d1 scope link
192.168.100.0/24 dev ens3 proto kernel scope link src 192.168.100.71
192.168.102.0/24 dev ens4 proto kernel scope link src 192.168.102.71
root@eck-worker02:~# ip route show
default via 192.168.100.1 dev ens3 proto static
10.1.97.145 dev cali24e82f0ac32 scope link
10.1.97.147 dev cali042c4949e87 scope link
10.1.97.148 dev cali69c374ce9a6 scope link
10.1.97.154 dev cali8b67281a788 scope link
10.1.97.155 dev cali3ab98e7ef56 scope link
10.1.97.156 dev cali0d440ae4e03 scope link
10.1.97.157 dev cali9bbe8a269ad scope link
10.1.97.162 dev cali9382a4fd853 scope link
10.1.97.164 dev cali392ae983540 scope link
10.1.102.0/26 via 10.1.102.0 dev vxlan.calico onlink
10.1.118.192/26 via 10.1.118.192 dev vxlan.calico onlink
10.1.194.128/26 via 10.1.194.128 dev vxlan.calico onlink
192.168.100.0/24 dev ens3 proto kernel scope link src 192.168.100.72
192.168.102.0/24 dev ens4 proto kernel scope link src 192.168.102.72
root@eck-worker03:~# ip route show
default via 192.168.100.1 dev ens3 proto static
10.1.97.128/26 via 10.1.97.128 dev vxlan.calico onlink
10.1.102.3 dev cali88e0e4ee76f scope link
10.1.102.8 dev cali52f2043b50e scope link
10.1.102.9 dev caliac9d8debdd8 scope link
10.1.102.10 dev cali63b96077834 scope link
10.1.118.192/26 via 10.1.118.192 dev vxlan.calico onlink
10.1.194.128/26 via 10.1.194.128 dev vxlan.calico onlink
192.168.100.0/24 dev ens3 proto kernel scope link src 192.168.100.73
192.168.102.0/24 dev ens4 proto kernel scope link src 192.168.102.73
And this is the full iptables-save from host:
# Generated by iptables-save v1.8.7 on Thu Apr 7 09:30:08 2022
*mangle
:PREROUTING ACCEPT [433177972:332515709504]
:INPUT ACCEPT [433176194:332515609320]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [285353385:132902329149]
:POSTROUTING ACCEPT [285353385:132902329149]
:LIBVIRT_PRT - [0:0]
-A POSTROUTING -j LIBVIRT_PRT
-A LIBVIRT_PRT -o virbr0 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill
COMMIT
# Completed on Thu Apr 7 09:30:08 2022
# Generated by iptables-save v1.8.7 on Thu Apr 7 09:30:08 2022
*raw
:PREROUTING ACCEPT [433177972:332515709504]
:OUTPUT ACCEPT [285353385:132902329149]
COMMIT
# Completed on Thu Apr 7 09:30:08 2022
# Generated by iptables-save v1.8.7 on Thu Apr 7 09:30:08 2022
*filter
:INPUT ACCEPT [433176194:332515609320]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [285353385:132902329149]
:LIBVIRT_FWI - [0:0]
:LIBVIRT_FWO - [0:0]
:LIBVIRT_FWX - [0:0]
:LIBVIRT_INP - [0:0]
:LIBVIRT_OUT - [0:0]
-A INPUT -j LIBVIRT_INP
-A FORWARD -j LIBVIRT_FWX
-A FORWARD -j LIBVIRT_FWI
-A FORWARD -j LIBVIRT_FWO
-A OUTPUT -j LIBVIRT_OUT
-A LIBVIRT_FWI -d 192.168.122.0/24 -o virbr0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A LIBVIRT_FWI -o virbr0 -j REJECT --reject-with icmp-port-unreachable
-A LIBVIRT_FWO -s 192.168.122.0/24 -i virbr0 -j ACCEPT
-A LIBVIRT_FWO -i virbr0 -j REJECT --reject-with icmp-port-unreachable
-A LIBVIRT_FWX -i virbr0 -o virbr0 -j ACCEPT
-A LIBVIRT_INP -i virbr0 -p udp -m udp --dport 53 -j ACCEPT
-A LIBVIRT_INP -i virbr0 -p tcp -m tcp --dport 53 -j ACCEPT
-A LIBVIRT_INP -i virbr0 -p udp -m udp --dport 67 -j ACCEPT
-A LIBVIRT_INP -i virbr0 -p tcp -m tcp --dport 67 -j ACCEPT
-A LIBVIRT_OUT -o virbr0 -p udp -m udp --dport 53 -j ACCEPT
-A LIBVIRT_OUT -o virbr0 -p tcp -m tcp --dport 53 -j ACCEPT
-A LIBVIRT_OUT -o virbr0 -p udp -m udp --dport 68 -j ACCEPT
-A LIBVIRT_OUT -o virbr0 -p tcp -m tcp --dport 68 -j ACCEPT
COMMIT
# Completed on Thu Apr 7 09:30:08 2022
# Generated by iptables-save v1.8.7 on Thu Apr 7 09:30:08 2022
*nat
:PREROUTING ACCEPT [1270779:76548008]
:INPUT ACCEPT [1269001:76447824]
:OUTPUT ACCEPT [274161:12540602]
:POSTROUTING ACCEPT [274161:12540602]
:LIBVIRT_PRT - [0:0]
-A POSTROUTING -j LIBVIRT_PRT
-A LIBVIRT_PRT -s 192.168.122.0/24 -d 224.0.0.0/24 -j RETURN
-A LIBVIRT_PRT -s 192.168.122.0/24 -d 255.255.255.255/32 -j RETURN
-A LIBVIRT_PRT -s 192.168.122.0/24 ! -d 192.168.122.0/24 -p tcp -j MASQUERADE --to-ports 1024-65535
-A LIBVIRT_PRT -s 192.168.122.0/24 ! -d 192.168.122.0/24 -p udp -j MASQUERADE --to-ports 1024-65535
-A LIBVIRT_PRT -s 192.168.122.0/24 ! -d 192.168.122.0/24 -j MASQUERADE
-A LIBVIRT_PRT -s 192.168.122.0/24 -d 224.0.0.0/24 -j RETURN
-A LIBVIRT_PRT -s 192.168.122.0/24 -d 255.255.255.255/32 -j RETURN
-A LIBVIRT_PRT -s 192.168.122.0/24 ! -d 192.168.122.0/24 -p tcp -j MASQUERADE --to-ports 1024-65535
-A LIBVIRT_PRT -s 192.168.122.0/24 ! -d 192.168.122.0/24 -p udp -j MASQUERADE --to-ports 1024-65535
-A LIBVIRT_PRT -s 192.168.122.0/24 ! -d 192.168.122.0/24 -j MASQUERADE
COMMIT
# Completed on Thu Apr 7 09:30:08 2022
the only reccurent log I see on the metallb-controller is the following line:
W0406 21:22:16.792009 1 reflector.go:302] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: watch of *v1.ConfigMap ended with: too old resource version: 10848161 (10849179)
But there's no remarkable warnings about the issue we're discussing, so it seems.
Best Regards,
Yes, the routing tables look good, at this point I think tcpdump is needed to point out what may be going wrong.
We tried with nodePort service, and the same issue occured.
I see the nmap, can you also try a curl?
I see the nmap, can you also try a curl?
Here it is:
root@host:~# curl https://192.168.100.100:30932/ --insecure
curl: (28) Failed to connect to 192.168.100.100 port 30932: Connection timed out
The captures are too big to be fit for attachement here, so I create this wetransfer link: https://we.tl/t-p8xP4Zkyfy
On host, I used:
tcpdump -i any -n not tcp port 22 -w /tmp/net100_lb_all_traffic2.pcap
On VM, I used:
tcpdump -i any -n 'not tcp port 22' -w /tmp/vm_incoming3.pcap
My wireshark filter is as follow:
not ip.dst_host==192.168.102.4 and not ip.src_host==192.168.102.4 and (ip.dst_host==192.168.100.100 or ip.src_host==192.168.100.100)
Seems I'm getting ICMP response unreachable. I also see some STP traces from the Cisco appliance.
I have to admit my networking skills are a little rusty and I cannot read these traces as easily as my younger self could.
I see the nmap, can you also try a curl?
Here it is:
root@host:~# curl https://192.168.100.100:30932/ --insecure curl: (28) Failed to connect to 192.168.100.100 port 30932: Connection timed out
How about curl https://192.168.100.70:30932
(on the IPs of the nodes directly)?
How about curl https://192.168.100.70:30932 (on the IPs of the nodes directly)?
From the node itself:
root@eck-master:~# curl https://192.168.100.70:30932 --insecure
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
From the host, it stalls indefinitively (so far, I'll let it TO):
root@host:~# curl https://192.168.100.70:30932 --insecure -vvv
* Trying 192.168.100.70:30932...
* connect to 192.168.100.70 port 30932 failed: Connection timed out
* Failed to connect to 192.168.100.70 port 30932: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to 192.168.100.70 port 30932: Connection timed out
( on another term, at the same time: )
root@host:~# nmap -p1-65535 192.168.100.70 -Pn
Starting Nmap 7.80 ( https://nmap.org ) at 2022-04-07 15:08 CEST
Nmap scan report for 192.168.100.70
Host is up (0.00052s latency).
Not shown: 65520 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
7472/tcp open unknown
7946/tcp open unknown
10250/tcp open unknown
10255/tcp open unknown
10257/tcp open unknown
10259/tcp open unknown
16443/tcp open unknown
19001/tcp open unknown
25000/tcp open icl-twobase1
30418/tcp open unknown
30932/tcp filtered unknown
31489/tcp filtered unknown
32092/tcp filtered unknown
MAC Address: 52:54:00:9F:B0:4D (QEMU virtual NIC)
Nmap done: 1 IP address (1 host up) scanned in 3.24 seconds
root@host:~#
from the other nodes, it also stall:
root@eck-worker01:~# curl https://192.168.100.70:30932 --insecure
curl: (28) Failed to connect to 192.168.100.70 port 30932: Connection timed out
root@eck-worker01:~# curl https://192.168.100.100:443 --insecure
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
EDIT: tcpdump things are on my comment above
as a follow-up:
cross-nodes, the curl on 30932 is not working, but it works on each node querying itself
root@eck-master:~# kubectl -n eck get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
<snipped>
logstash-prod-logstash LoadBalancer 10.152.183.15 192.168.100.101 5044:31489/TCP 26h
ingress-nginx-controller LoadBalancer 10.152.183.92 192.168.100.100 80:32092/TCP,443:30932/TCP 2d
root@eck-master:~# curl https://192.168.100.70:30932 --insecure
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
root@eck-worker01:~# curl https://192.168.100.71:30932 --insecure
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
root@eck-worker03:~# curl https://192.168.100.73:30932 --insecure
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
as a follow-up:
cross-nodes, the curl on 30932 is not working, but it works on each node querying itself
root@eck-master:~# kubectl -n eck get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE <snipped> logstash-prod-logstash LoadBalancer 10.152.183.15 192.168.100.101 5044:31489/TCP 26h ingress-nginx-controller LoadBalancer 10.152.183.92 192.168.100.100 80:32092/TCP,443:30932/TCP 2d root@eck-master:~# curl https://192.168.100.70:30932 --insecure <html> <head><title>404 Not Found</title></head> <body> <center><h1>404 Not Found</h1></center> <hr><center>nginx</center> </body> </html> root@eck-worker01:~# curl https://192.168.100.71:30932 --insecure <html> <head><title>404 Not Found</title></head> <body> <center><h1>404 Not Found</h1></center> <hr><center>nginx</center> </body> </html> root@eck-worker03:~# curl https://192.168.100.73:30932 --insecure <html> <head><title>404 Not Found</title></head> <body> <center><h1>404 Not Found</h1></center> <hr><center>nginx</center> </body> </html>
Okay, judging from that I think the most possible scenario is that something is cutting off network traffic in high ports (30000-32XXX). The issues you are facing with the loadbalancer also stem from that.
Can you test the same scenario, but using the local bridge instead? (that would be 192.168.122.0/24, virbr0). If the issue is with MicroK8s, I would expect the error to persist.
I will try & setup the cluster accordinly, but it might take some time; I do not know if the ticket may be put on hold until monday or so
best regards,
Greetings,
So we tried to set-up a POC environnement, and so far, we failed to even deploy the LoadBalancer (keeps in Pending Status)
poc-master on virbr0 with 192.168.122.193 poc-worker01 on virbr0 with 192.168.122.177 poc-worker02 on virbr0 with 192.168.122.131
We found out that one on the calico component was failing. We are not sure if it's linked or not to our issue. What's your advice?
We tried to recreate the poc-cluster from scratch, without any deployment, and calico-kube-controllers is still in CrashLoopBack state
Here is what we did on poc-master:
root@poc-master:~# snap install microk8s --classic --channel=latest/stable
microk8s v1.23.5 from Canonical✓ installed
root@poc-master:~# usermod -a -G microk8s $USER
root@poc-master:~# chown -f -R $USER ~/.kube
root@poc-master:~# microk8s status --wait-ready
microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
addons:
enabled:
ha-cluster # Configure high availability on the current node
disabled:
ambassador # Ambassador API Gateway and Ingress
cilium # SDN, fast with full network policy
dashboard # The Kubernetes dashboard
dashboard-ingress # Ingress definition for Kubernetes dashboard
dns # CoreDNS
fluentd # Elasticsearch-Fluentd-Kibana logging and monitoring
gpu # Automatic enablement of Nvidia CUDA
helm # Helm 2 - the package manager for Kubernetes
helm3 # Helm 3 - Kubernetes package manager
host-access # Allow Pods connecting to Host services smoothly
inaccel # Simplifying FPGA management in Kubernetes
ingress # Ingress controller for external access
istio # Core Istio service mesh services
jaeger # Kubernetes Jaeger operator with its simple config
kata # Kata Containers is a secure runtime with lightweight VMS
keda # Kubernetes-based Event Driven Autoscaling
knative # The Knative framework on Kubernetes.
kubeflow # Kubeflow for easy ML deployments
linkerd # Linkerd is a service mesh for Kubernetes and other frameworks
metallb # Loadbalancer for your Kubernetes cluster
metrics-server # K8s Metrics Server for API access to service metrics
multus # Multus CNI enables attaching multiple network interfaces to pods
openebs # OpenEBS is the open-source storage solution for Kubernetes
openfaas # OpenFaaS serverless framework
portainer # Portainer UI for your Kubernetes cluster
prometheus # Prometheus operator for monitoring and logging
rbac # Role-Based Access Control for authorisation
registry # Private image registry exposed on localhost:32000
storage # Storage class; allocates storage from host directory
traefik # traefik Ingress controller for external access
root@poc-master:~# microk8s enable dns storage
Enabling DNS
Applying manifest
serviceaccount/coredns created
configmap/coredns created
deployment.apps/coredns created
service/kube-dns created
clusterrole.rbac.authorization.k8s.io/coredns created
clusterrolebinding.rbac.authorization.k8s.io/coredns created
Restarting kubelet
DNS is enabled
Enabling default storage class
deployment.apps/hostpath-provisioner created
storageclass.storage.k8s.io/microk8s-hostpath created
serviceaccount/microk8s-hostpath created
clusterrole.rbac.authorization.k8s.io/microk8s-hostpath created
clusterrolebinding.rbac.authorization.k8s.io/microk8s-hostpath created
Storage will be available soon
oot@poc-master:~# vi /var/snap/microk8s/current/certs/csr.conf.template
root@poc-master:~# microk8s refresh-certs
Taking a backup of the current certificates under /var/snap/microk8s/3052/var/log/ca-backup/
Creating new certificates
Can't load /root/.rnd into RNG
140342700733888:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:88:Filename=/root/.rnd
Can't load /root/.rnd into RNG
140211887805888:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:88:Filename=/root/.rnd
Signature ok
subject=C = GB, ST = Canonical, L = Canonical, O = Canonical, OU = Canonical, CN = 127.0.0.1
Getting CA Private Key
Signature ok
subject=CN = front-proxy-client
Getting CA Private Key
1
Creating new kubeconfig file
Stopped.
Started.
The CA certificates have been replaced. Kubernetes will restart the pods of your workloads.
Any worker nodes you may have in your cluster need to be removed and re-joined to become aware of the new CA.
root@poc-master:~# microk8s enable metallb:192.168.122.200-192.168.122.209
Enabling MetalLB
Applying Metallb manifest
namespace/metallb-system created
secret/memberlist created
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
Warning: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead
daemonset.apps/speaker created
deployment.apps/controller created
configmap/config created
MetalLB is enabled
root@poc-master:~# kubectl taint nodes poc-master node-role.kubernetes.io/master=:NoSchedule
node/poc-master tainted
root@poc-master:~# microk8s add-node
On workers, here is what we did:
root@poc-worker01:~# snap install microk8s --classic --channel=latest/stable
microk8s v1.23.5 from Canonical✓ installed
root@poc-worker01:~# microk8s join 192.168.122.193:25000/fc169cbb2003ec5642dd050b9afb11d8/127e916e28e6 --worker
Contacting cluster at 192.168.122.193
The node has joined the cluster and will appear in the nodes list in a few seconds.
Currently this worker node is configured with the following kubernetes API server endpoints:
- 192.168.122.193 and port 16443, this is the cluster node contacted during the join operation.
If the above endpoints are incorrect, incomplete or if the API servers are behind a loadbalancer please update
/var/snap/microk8s/current/args/traefik/provider.yaml
and here is the result from kubectl from Host:
root@host:~/# kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-64c6478b6c-p2qkx 0/1 Running 1 (5m21s ago) 6m14s
metallb-system pod/speaker-6vw7v 1/1 Running 0 3m45s
kube-system pod/calico-node-nmq6m 1/1 Running 0 93s
metallb-system pod/speaker-999mv 1/1 Running 0 93s
kube-system pod/hostpath-provisioner-7764447d7c-j7xdj 1/1 Running 0 3m45s
metallb-system pod/controller-558b7b958-z4gnh 1/1 Running 0 3m45s
kube-system pod/calico-node-ps67f 1/1 Running 0 93s
metallb-system pod/speaker-p5pch 1/1 Running 0 63s
kube-system pod/calico-node-5j4mn 1/1 Running 0 62s
kube-system pod/calico-kube-controllers-55bcdcf5c6-ht8dz 0/1 CrashLoopBackOff 8 (23s ago) 6m51s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 6m57s
kube-system service/kube-dns ClusterIP 10.152.183.10 <none> 53/UDP,53/TCP,9153/TCP 6m14s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
metallb-system daemonset.apps/speaker 3 3 3 3 3 beta.kubernetes.io/os=linux 4m49s
kube-system daemonset.apps/calico-node 3 3 3 3 3 kubernetes.io/os=linux 6m56s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/calico-kube-controllers 0/1 1 0 6m56s
kube-system deployment.apps/coredns 0/1 1 0 6m14s
kube-system deployment.apps/hostpath-provisioner 1/1 1 1 6m4s
metallb-system deployment.apps/controller 1/1 1 1 4m49s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/calico-kube-controllers-55bcdcf5c6 1 1 0 6m52s
kube-system replicaset.apps/coredns-64c6478b6c 1 1 0 6m14s
kube-system replicaset.apps/hostpath-provisioner-7764447d7c 1 1 1 3m45s
metallb-system replicaset.apps/controller-558b7b958 1 1 1 3m45s
And when we tried deploying nginx-ingress, here were the state of the cluster:
NAMESPACE NAME READY STATUS RESTART S AGE
<snipped>
kube-system pod/calico-kube-controllers-7c6fcdff9f-5hzsf 0/1 Error 13 (32s ago) 25h
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
eck service/cert-manager-webhook ClusterIP 10.152.183.238 <none> 443/TCP 11m
eck service/cert-manager ClusterIP 10.152.183.82 <none> 9402/TCP 11m
eck service/ingress-nginx-controller-admission ClusterIP 10.152.183.14 <none> 443/TCP 11m
eck service/ingress-nginx-controller LoadBalancer 10.152.183.68 <pending> 80:31764/TCP,443:31426/TCP 11m
~# kubectl logs calico-kube-controllers-7c6fcdff9f-5hzsf -n kube-system
2022-04-14 09:32:22.052 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0414 09:32:22.054172 1 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2022-04-14 09:32:22.055 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2022-04-14 09:32:32.055 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.152.183.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2022-04-14 09:32:32.055 [FATAL][1] main.go 114: Failed to initialize Calico datastore error=Get "https://10.152.183.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
On the former and OP cluster, we do not have issues with calico-controller:
2022-04-14 09:43:49.608 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0414 09:43:49.612469 1 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2022-04-14 09:43:49.613 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2022-04-14 09:43:56.814 [INFO][1] main.go 149: Getting initial config snapshot from datastore
2022-04-14 09:43:56.853 [INFO][1] main.go 152: Got initial config snapshot
2022-04-14 09:43:56.854 [INFO][1] watchersyncer.go 89: Start called
2022-04-14 09:43:56.854 [INFO][1] main.go 169: Starting status report routine
2022-04-14 09:43:56.854 [INFO][1] main.go 402: Starting controller ControllerType="Node"
2022-04-14 09:43:56.854 [INFO][1] node_controller.go 138: Starting Node controller
2022-04-14 09:43:56.854 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2022-04-14 09:43:56.854 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready
2022-04-14 09:43:56.854 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2022-04-14 09:43:56.854 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-14 09:43:56.854 [INFO][1] resources.go 349: Main client watcher loop
2022-04-14 09:43:56.868 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-14 09:43:56.868 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2022-04-14 09:43:56.868 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync
2022-04-14 09:43:56.868 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2022-04-14 09:43:56.868 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2022-04-14 09:43:56.868 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2022-04-14 09:43:56.868 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync
2022-04-14 09:43:56.882 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints
2022-04-14 09:43:56.955 [INFO][1] node_controller.go 151: Node controller is now running
2022-04-14 09:43:56.955 [INFO][1] ipam.go 45: Synchronizing IPAM data
2022-04-14 09:43:56.999 [INFO][1] ipam.go 191: Node and IPAM data is in sync
@neoaggelos I'll also add our netplan config file from the host to the discussion:
~# cat /etc/netplan/00-installer-config.yaml
network:
version: 2
renderer: networkd
ethernets:
eno1:
dhcp4: false
dhcp6: false
match:
macaddress: b0:7b:25:be:ea:da
enp2s0f0np0:
dhcp4: false
dhcp6: false
match:
macaddress: 2c:ea:7f:a7:49:3b
bridges:
net100:
interfaces: [eno1]
mtu: 1500
addresses: [ 192.168.100.56/24 ]
nameservers:
addresses:
- 109.205.64.35
- 208.67.222.222
- 208.67.220.220
search: []
routes:
- to: default
via: 192.168.100.1
net102:
interfaces: [enp2s0f0np0]
mtu: 1500
addresses: [ 192.168.102.4/24 ]
I tried to bypass the cisco appliance by plugging a L2 switch on the NIC, the issue remains.
root@host:~# arping -I net100 192.168.100.100
ARPING 192.168.100.100 from 192.168.100.56 net100
Unicast reply from 192.168.100.100 [52:54:00:A8:16:6A] 0.937ms
Unicast reply from 192.168.100.100 [52:54:00:A8:16:6A] 1.071ms
Unicast reply from 192.168.100.100 [52:54:00:A8:16:6A] 0.983ms
Unicast reply from 192.168.100.100 [52:54:00:A8:16:6A] 1.088ms
Sent 4 probes (1 broadcast(s))
The issue remains, no route to host while trying to reach 192.168.100.100 on 443
OK,
After further investigation, it seems the ha-cluster module was to blame. Upon disabling it and setting up the cluster again, everything seems fine.
ha-cluster disabled means there is no calico layer deployed anymore, so it seems.
We will run it for a week and will let you know how it goes.
Hello, We ran into a similar problem. is there anything known about the Problems of HA configuration with metallb?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
greetings,
My setup is as-is: On 192.168.100.0/24 network, I got an ubuntu 21.10 host
This host runs 4 VMs:
All VMs are using bridged interface to net100, in order to be accessible from 192.168.100.0/24 without routing/NATing.
I have enabled dns, storage and metallb modules, from a 1.23.5 release (running with ubuntu 21.10) metallb range is from 192.168.100.100 to 192.168.100.109
](Within the cluster, everything seems fine:
Outside of it, there is no route to host.
From the host itself, handling the VMs:
IPTables are, on host:
on VMs:
here are the different nmap results:
Using kubectl port forward from the host is working.
Have I done something wrong with this setup? What would you suggest for me to be digging?
Best Regards,
inspection-report-20220406_145730.tar.gz