How to Add Multiple Masters and Rerun the Script for Adding Workers

GuillaumeDorschner commented 6 months ago

Hello Clemenko,

before everything I want to tell you that this repo is amazing I was looking for something like this for a long time. I didn't know hauler. I need to setup a cluster offline (so air gap). I want to know if it's possible to have multiple masters ? We can do curl -sfL http://192.168.x.x:8080/hauler_all_the_things.sh | bash -s -- worker 192.168.x.x for the worker but I need 3 masters so how we do it? And I tried adding a worker but I got an error I think it's due to the config of a rke2 that I forgot to remove. So I removed it after but I think the script didn't actually rerun because I can see the message of the first run. or maybe it does but I don't. Also I'm getting stuck there after the [info] adding yum repo:

[root@server1 ~]# curl -sfL http://192.168.x.x:8080/hauler_all_the_things.sh | bash -s -- worker 192.168.x.x
- deploy worker
[info] updating kernel settings
[info] firewalld not installed # this got my attention because I have firewalld installed (not the first time I run the script)
[info] installing base packages
[info] adding yum repo

So how can I rerun the script to add a worker and how I can add multiple masters?

clemenko commented 6 months ago

Hi Guillaume, the script is designed for a single master due to the need for a load balancer or control of internal DNS. It gets MUCH more complicated with HA. Here is a video I did on it : https://youtu.be/Um_GVIL71xQ

personally I recommend disabling firewalld.

When you run the worker piece it only runs a few things. https://github.com/clemenko/rke_airgap_install/blob/main/hauler_all_the_things.sh#L410

Base - kernel tuning and packages
Setup the config file
setup the registries file
install and start rke2.

Does the service start? What does kubectl get node show on the master?

GuillaumeDorschner commented 6 months ago

Thank's for the quick answer !

The master does start, I did connect useing the kubeconfig:

➜  ansible git:(main) ✗ kubectl get nodes
NAME   STATUS   ROLES                       AGE    VERSION
nuc1   Ready    control-plane,etcd,master   176m   v1.28.9+rke2r1

I'm see the logs could you help me ? I don't know where or what look for.

May 17 14:11:02 server3 sh[1997866]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
May 17 14:11:02 server3 sh[1997867]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=warning msg="not running in CIS mode"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=info msg="Applying Pod Security Admission Configuration"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=info msg="Starting rke2 v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886)"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use t>
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=info msg="Managed etcd cluster not yet initialized"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use t>
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=fatal msg="starting kubernetes: preparing server: failed to validate server configuration: not authorized"
May 17 14:11:02 server3 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
May 17 14:11:02 server3 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit rke2-server.service has entered the 'failed' state with result 'exit-code'.
May 17 14:11:02 server3 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
-- Subject: Unit rke2-server.service has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit rke2-server.service has failed.
-- 
-- The result is failed.

EDIT: I retry after a while and now I have this

[root@server3 ~]# curl -sfL http://192.168.x.x:8080/./hauler_all_the_things.sh | bash -s -- worker 192.168.x.x
- deploy worker
[info] updating kernel settings
[info] firewalld not installed
[info] installing base packages
[error] iptables container-selinux iptables libnetfilter_conntrack libnfnetlink libnftnl policycoreutils-python-utils cryptsetup iscsi-initiator-utils packages didn't install
[root@server3 ~]# yum install iptables container-selinux iptables libnetfilter_conntrack libnfnetlink libnftnl policycoreutils-python-utils cryptsetup iscsi-initiator-utils -y
Rancher RKE2 Common (stable)                                                                                                                                            0.0  B/s |   0  B     00:00    
Errors during downloading metadata for repository 'rancher-rke2-common-stable':
  - Curl error (6): Couldn't resolve host name for https://rpm.rancher.io/rke2/stable/common/centos/8/noarch/repodata/repomd.xml [Could not resolve host: rpm.rancher.io]
Error: Failed to download metadata for repo 'rancher-rke2-common-stable': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

clemenko commented 6 months ago

Is server3 a worker node? Looks like you are running rke2-server on it?

GuillaumeDorschner commented 6 months ago

Yes, I did run curl -sfL http://192.168.x.x:8080/./hauler_all_the_things.sh | bash -s -- worker 192.168.x.x Maybe I didn't uninstall all the way the server that was there before.

clemenko commented 6 months ago

you can run rke2-uninstall.sh on that node and re-run the curl command

GuillaumeDorschner commented 6 months ago

It turns out the problem was due to not properly uninstalling the RKE/Rancher software I tried to use earlier. Now it’s working; it's a great.

If need

/usr/bin/rke2-killall.sh
/usr/bin/rke2-uninstall.sh
/usr/bin/rancher-system-agent-uninstall.sh

/usr/local/bin/k3s-uninstall.sh
/usr/local/bin/k3s-agent-uninstall.sh
/usr/local/bin/rancher-system-agent-uninstall.sh

yum remove rancher-system-agent

rm -rf /etc/rancher
rm -rf /var/lib/rancher

reboot

clemenko commented 6 months ago

awesome!

clemenko / rke_airgap_install

How to Add Multiple Masters and Rerun the Script for Adding Workers #17