clustervision / trinityX

TrinityX is the new generation of ClusterVision's open-source HPC, A/I and cloudbursting platform. It is designed from the ground up to provide all services required in a modern HPC and A/I system, and to allow full customization of the installation.
GNU General Public License v3.0
53 stars 36 forks source link

LUNA2 issues with node ip address change and iPXE boot #420

Closed xdkreij closed 18 hours ago

xdkreij commented 1 week ago

Problem description LUNA2 overwrites dhcpd.conf (which is fine i guess) when creating a node. However, I cannot seem to get iPXE working when diverting from x.x.x.1 for node001;

Working Scenario

  1. remove node record from dhcpd.conf
  2. luna -v node remove node001
  3. luna node add -g $GROUP -if BOOTIF -M $MAC $NAME -o compute
  4. reboot node

[INFO]:[2024-07-02 16:02:08,277]:[MainThread]:[boot.py:default@162] - Boot API serving the templ_boot_ipxe.cfg

image

image

Issues arises - Steps to reproduce

  1. remove node record from dhcpd.conf
  2. luna -v node remove node001
  3. luna node add -g $GROUP -if BOOTIF -M $MAC $NAME -if BOOTIF -I 100.66.5.2 -o compute
  4. reboot node

..broken...no iPXE boot... no luna2-daemon log messages nothing; Not even on ip's .5.3.. .5.4.. and so on

image

Expected Results Luna to accept IP Changes and provide iPXE to the new node. note that 100.66.5.1 is the default gateway, so it can't be provided to nodes.... hence the requirement to change it

I was hoping that adding the gateway to the network config would make sure when creating nodes, it would ignore the .5.1 address but.. it does not

image

xdkreij commented 1 week ago

Side q:

In tmpl_nodeboot.cfg... does this luna.gw={{ NETWORK_GATEWAY }} work?

It sure doesn't pick up the gateway from luna network show cpu.cluster

imgargs kernel root=luna luna.bootproto=static luna.mac=00:50:56:03:16:b9 luna.ip=100.66.5.2/24 **luna.gw=** luna.u

edit: seems it it's being set like so templ_dhcpd.cfg: option routers {{ SUBNETS[SUBNET]['gateway'] }};

aphmschonewille commented 4 days ago

the various templates are being used at different stages. luna tries to use the most suitable GW if supplied or needed. tmpl_noodeboot.cfg/tmp_ipxe_boot.cfg it depends on whether you are on the same network as the network the controller(s) reside in and if you have specified a gateway in the network definition. This is primarily to be able to reach the controller so installation can continue. for tmpl_install.cfg, all networks/interfaces are supplied to create interface files, which are used after installation and after the pivot. tmpl_dhcpd.cfg is used for creating /etc/dhcp/dhcpd.conf and is not used for booting.

xdkreij commented 2 days ago

the various templates are being used at different stages. luna tries to use the most suitable GW if supplied or needed. tmpl_noodeboot.cfg/tmp_ipxe_boot.cfg it depends on whether you are on the same network as the network the controller(s) reside in and if you have specified a gateway in the network definition. This is primarily to be able to reach the controller so installation can continue. for tmpl_install.cfg, all networks/interfaces are supplied to create interface files, which are used after installation and after the pivot. tmpl_dhcpd.cfg is used for creating /etc/dhcp/dhcpd.conf and is not used for booting.

@aphmschonewille Thanks for the information. Can you give me pointers to how to troubleshoot luna/ipxe in such way that would allow me to gather more information on why only 100.66.5.1 actually boots on iPXE, and why other ip addresses like 100.66.5.2 .5.3 .5.4 etc.. do not?

I truly wonder why only the gateway IP 100.66.5.1 works as node IP address; I'd rather make sure the 100.66.5.1 is being ignored and used as default route/gw - and start at a different range, for example .5.100 - .5.200

xdkreij commented 18 hours ago

@aphmschonewille

I think I've found the issue. (didn't have much time to troubleshoot earlier due to upcoming holidays) Posting the solution here for future reference & others whom might come across similar issue(s)

Register this system with Red Hat Insights: insights-client --register Create an account or view all your systems at https://red.ht/insights-dashboard

Last login: Wed Jul 10 04:58:35 2024 from 100.66.2.220



- verify routing works

on `node001`
attempted to ping the gateway `ping 100.66.5.1`  - this was successful
attempted to ping the controller `ping 100.66.5.240`  - this was not successful 

on `cpu controller`
attempted to ping the gateway `ping 100.66.5.1`  - this was successful
attempted to ping local interface `ping 100.66.5.240`  - this was successful
attempted to ping the node001 `ping 100.66.5.2`  - this was successful

Verified the network configurations - the network interface on the controller had the following `100.66.5.240/32` 🤦 

I'm a bit embarrassed by such a mistake (I'm a former network engineer - CCNP) 🤣 🤣 🤣 

After changing the interface to `100.66.5.240/24`, everything worked fine!!

I'm not yet entirely sure why `100.66.5.1` did work, but I suspect it has something to do with the fact that the
interface controller had the gateway `100.66.5.1` configured correctly as well (and thus .5.1 and .5.240 fall within the /32)

Thanks for all endeavors made so far!