DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Fix networking on k8s-dev-ctrl-1 #33

Closed nickatnceas closed 2 years ago

nickatnceas commented 2 years ago

Networking is partially broken after rebooting k8s-dev-ctrl-1. This was immediately apparent when K8s failed to start back up (the kubectl command did not work), and DNS would not return internal hostname resolution . The names docker-dev-ucsb-1.test.dataone.org and docker-dev-ucsb-1 would not resolve, even though they were in /etc/hosts.

After getting internal DNS working by allowing access to the systemd-resolved server on 127.0.0.53 via UFW rules:

sudo ufw allow on lo to any
sudo ufw allow from 127.0.0.53

The system was then able to resolve the internal hostnames, and K8s started up and appears to be working properly.

After that external DNS resolution started failing, for example hostnames under ubuntu.com and nceas.ucsb.edu would not resolve. After creating a UFW rule to allow access to the upstream DNS server at 128.111.1.1 via sudo ufw allow from 128.111.1.0/24 DNS appears to be fully working again.

However, outgoing network traffic to hosts that are not specifically allowed in UFW rules are still broken. For example, ping 8.8.8.8 fails on k8s-dev-ctrl-1, but works on k8s-node-ctrl-1. The default UFW rules allow all outgoing traffic, and I expect a network configuration is currently blocking outgoing requests. We can get around some of this issue by opening IPs and subnets in UFW, but opening the entire internet via UFW will remove the firewall entirely.

I'm comparing the iptables config on k8s-dev-ctrl-1 to k8s-dev-node-1, which was also rebooted, but works as expected.

nickatnceas commented 2 years ago

When restarting ufw I got the following error:

outin@docker-dev-ucsb-1:~$ sudo ufw enable
ERROR: problem running ufw-init
Bad argument `*nat'
Error occurred at line: 75
Try `iptables-restore -h' or 'iptables-restore --help' for more information.

Problem running '/etc/ufw/before.rules'

I edited /etc/ufw/before.rules and commented out the offending section (which does not appear in k8s-dev-node-1):

# Forward port 30443 to 443 and 30080 to 80 - pcs
#*nat
#:PREROUTING ACCEPT [0:0]
#-A PREROUTING -p tcp --dport 30443 -j REDIRECT --to-port 443
#-A PREROUTING -p tcp --dport 30080 -j REDIRECT --to-port 80

And ufw starts again.

Disabling UFW allows the server to access outside hosts. We now have two ways of disabling the firewall to fix the issue, but we do want a firewall enabled.

gothub commented 2 years ago

hey @nick - apologies for that, I must have made this change awhile ago when trying to enable outside access to ports 80/443. Routing for these ports is now done by k8s and maybe that caused a collision. Here is how this is done now: https://github.com/DataONEorg/k8s-cluster/issues/16#issuecomment-987047094

Sorry about the trouble, thx for fixing this.

nickatnceas commented 2 years ago

I found net.ipv6.conf.all.forwarding=1 was enabled on k8s-dev-ctrl-1 but not k8s-dev-node-1. Default is not enabled, and we're not using ipv6, so I commented it out and applied the setting with sysctl -p /etc/sysctl.conf

nickatnceas commented 2 years ago

After reverting /etc/ufw/before.rules and /etc/sysctl.conf (see above) the networking was still not working as expected, and the server could not talk to anything besides what was allowed in the ufw rules. After restarting ufw several times to dump the running iptables config with and without UFW running, the network starting behaving properly.

I rebooted k8s-dev-ctrl-1 to confirm that the networking now survives reboots, and it does. K8s and networking came back up without issue.

I verified that the remaining K8s servers (k8s-dev-node-1, k8s-ctrl-1, k8s-node-1, k8s-node-2, k8s-node-3) did not have the same modifications to the two network files, they were only changed on k8s-dev-ctrl-1.

It looks like the network changes were made in December 2021, and k8s-dev-ctrl-1 had not been rebooted since they were made.