clemenko / rke_airgap_install

a script/method for air gapping the Rancher Stack with Hauler
47 stars 25 forks source link

How to Add Multiple Masters and Rerun the Script for Adding Workers #17

Closed GuillaumeDorschner closed 5 months ago

GuillaumeDorschner commented 5 months ago

Hello Clemenko,

before everything I want to tell you that this repo is amazing I was looking for something like this for a long time. I didn't know hauler. I need to setup a cluster offline (so air gap). I want to know if it's possible to have multiple masters ? We can do curl -sfL http://192.168.x.x:8080/hauler_all_the_things.sh | bash -s -- worker 192.168.x.x for the worker but I need 3 masters so how we do it? And I tried adding a worker but I got an error I think it's due to the config of a rke2 that I forgot to remove. So I removed it after but I think the script didn't actually rerun because I can see the message of the first run. or maybe it does but I don't. Also I'm getting stuck there after the [info] adding yum repo:

[root@server1 ~]# curl -sfL http://192.168.x.x:8080/hauler_all_the_things.sh | bash -s -- worker 192.168.x.x
- deploy worker
[info] updating kernel settings
[info] firewalld not installed # this got my attention because I have firewalld installed (not the first time I run the script)
[info] installing base packages
[info] adding yum repo

So how can I rerun the script to add a worker and how I can add multiple masters?

clemenko commented 5 months ago

Hi Guillaume, the script is designed for a single master due to the need for a load balancer or control of internal DNS. It gets MUCH more complicated with HA. Here is a video I did on it : https://youtu.be/Um_GVIL71xQ

personally I recommend disabling firewalld.

When you run the worker piece it only runs a few things. https://github.com/clemenko/rke_airgap_install/blob/main/hauler_all_the_things.sh#L410

Does the service start? What does kubectl get node show on the master?

GuillaumeDorschner commented 5 months ago

Thank's for the quick answer !

The master does start, I did connect useing the kubeconfig:

➜  ansible git:(main) ✗ kubectl get nodes
NAME   STATUS   ROLES                       AGE    VERSION
nuc1   Ready    control-plane,etcd,master   176m   v1.28.9+rke2r1

I'm see the logs could you help me ? I don't know where or what look for.

May 17 14:11:02 server3 sh[1997866]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
May 17 14:11:02 server3 sh[1997867]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=warning msg="not running in CIS mode"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=info msg="Applying Pod Security Admission Configuration"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=info msg="Starting rke2 v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886)"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use t>
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=info msg="Managed etcd cluster not yet initialized"
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use t>
May 17 14:11:02 server3 rke2[1997873]: time="2024-05-17T14:11:02+02:00" level=fatal msg="starting kubernetes: preparing server: failed to validate server configuration: not authorized"
May 17 14:11:02 server3 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
May 17 14:11:02 server3 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit rke2-server.service has entered the 'failed' state with result 'exit-code'.
May 17 14:11:02 server3 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
-- Subject: Unit rke2-server.service has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit rke2-server.service has failed.
-- 
-- The result is failed.

EDIT: I retry after a while and now I have this

[root@server3 ~]# curl -sfL http://192.168.x.x:8080/./hauler_all_the_things.sh | bash -s -- worker 192.168.x.x
- deploy worker
[info] updating kernel settings
[info] firewalld not installed
[info] installing base packages
[error] iptables container-selinux iptables libnetfilter_conntrack libnfnetlink libnftnl policycoreutils-python-utils cryptsetup iscsi-initiator-utils packages didn't install
[root@server3 ~]# yum install iptables container-selinux iptables libnetfilter_conntrack libnfnetlink libnftnl policycoreutils-python-utils cryptsetup iscsi-initiator-utils -y
Rancher RKE2 Common (stable)                                                                                                                                            0.0  B/s |   0  B     00:00    
Errors during downloading metadata for repository 'rancher-rke2-common-stable':
  - Curl error (6): Couldn't resolve host name for https://rpm.rancher.io/rke2/stable/common/centos/8/noarch/repodata/repomd.xml [Could not resolve host: rpm.rancher.io]
Error: Failed to download metadata for repo 'rancher-rke2-common-stable': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
clemenko commented 5 months ago

Is server3 a worker node? Looks like you are running rke2-server on it?

GuillaumeDorschner commented 5 months ago

Yes, I did run curl -sfL http://192.168.x.x:8080/./hauler_all_the_things.sh | bash -s -- worker 192.168.x.x Maybe I didn't uninstall all the way the server that was there before.

clemenko commented 5 months ago

you can run rke2-uninstall.sh on that node and re-run the curl command

GuillaumeDorschner commented 5 months ago

It turns out the problem was due to not properly uninstalling the RKE/Rancher software I tried to use earlier. Now it’s working; it's a great.

If need

/usr/bin/rke2-killall.sh
/usr/bin/rke2-uninstall.sh
/usr/bin/rancher-system-agent-uninstall.sh

/usr/local/bin/k3s-uninstall.sh
/usr/local/bin/k3s-agent-uninstall.sh
/usr/local/bin/rancher-system-agent-uninstall.sh

yum remove rancher-system-agent

rm -rf /etc/rancher
rm -rf /var/lib/rancher

reboot
clemenko commented 5 months ago

awesome!