Closed xdkreij closed 5 months ago
@msteggink Appreciate your support on this :-) Thanks in advance!
Hi @xdkreij is this a new install or an existing?
The could not get install script [000] refers to an issue in the Luna2 client phase. Can you check which node has been selected (dhcpd DHCPOFFER or luna node show)? Can you ssh to the node in the installer phase?
Can you try to restart luna2-daemon (systemctl restart luna2-daemon).
Also, please provide the following output (where image is most likely 'compute'): lchroot <image> rpm -qa | grep luna2 exit
rpm -qa | grep luna2
# chroot compute/
# rpm -qa | grep luna2
luna2-client-2.0-13.noarch.x86_64```
Hi @xdkreij is this a new install or an existing?
The could not get install script [000] refers to an issue in the Luna2 client phase. Can you check which node has been selected (dhcpd DHCPOFFER or luna node show)? Can you ssh to the node in the installer phase?
Can you try to restart luna2-daemon (systemctl restart luna2-daemon).
Yes, it's a clean install of the controller, and the image (compute)
I've removed all default nodes, and had to fix node.py to be able to add a new one (seems to be an old luna bug)
I'm currently testing with the newly created node, but here's the default output of luna node add -g $GROUP -if BOOTIF -M $MAC $NAME
# luna node show node001
+-----------------------------------------------------------------------------------------+
| Node => node001 |
+---------------------+-------------------------------------------------------------------+
| name | node001 |
| hostname | node001.some.domain.com |
| group | compute |
| osimage | compute (compute) |
| osimagetag | default (default) |
| interfaces | interface = BOOTIF |
| | ipaddress = 10.1.5.1 |
| | macaddress = 00:50:56:03:13:4a |
| | network = some.domain.com |
| | interface = BMC |
| | ipaddress = 10.148.0.1 |
| | macaddress = None |
| | network = ipmi |
| status | None |
| vendor | None |
| assettag | None |
| switch | None |
| switchport | None |
| setupbmc | False (compute) |
| bmcsetup | compute (compute) |
| unmanaged_bmc_users | None |
| netboot | True (compute) |
| localinstall | False (compute) |
| bootmenu | False (compute) |
| roles | None |
| service | False |
| prescript | <empty> (default) |
| partscript | (compute) mount -t tmpfs tmpfs /sysroot |
| postscript | (compute) echo 'tmpfs / tmpfs defaults 0 0' >> /sysroot/etc/fstab |
| provision_interface | BOOTIF (default) |
| provision_method | torrent (cluster) |
| provision_fallback | http (cluster) |
| tpm_uuid | None |
| tpm_pubkey | None |
| tpm_sha256 | None |
| comment | None |
+---------------------+-------------------------------------------------------------------+
Even better, now that i've added the node statically at luna with the MAC address and IP, it doesn't boot ipxe at all :face_with_spiral_eyes:
edit: when changing dhcp config - to change the ip address to the internal node network on the filename
and next-server
, ipxe boots again nicely to the boot menu. However, with the defaults above, and Ask lune for a node name...
, it still returns
http://<ip>/boot/search/mac/<mac> network unreachable
I think i know why, because the network ip that is provided in the message isn't the same as the internal node network Is this a luna config? I would have expected that the attempt to install a node, would only have to reach it's own internal network, and not directly the other network that has been set on the controller as primary
Seems to work fine on both internal node network as well as the main controller network
ss -tulpn | grep 7051
tcp LISTEN 0 128 0.0.0.0:7051 0.0.0.0:* users:(("nginx",pid=1760,fd=8),("nginx",pid=1759,fd=8),("nginx",pid=1758,fd=8),("nginx",pid=1757,fd=8),("nginx",pid=1755,fd=8))
When testing with curl..
curl http://10.1.5.240:7051/boot/search/mac/00:50:56:03:13:4a
#!ipxe
imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64
imgload kernel
imgargs kernel root=luna luna.bootproto=static luna.mac=00:50:56:03:13:4a luna.ip=10.1.5.1/24 luna.gw= luna.url=https://10.1.2.220:7050 luna.verifycert=False luna.node=node001 luna.hostname=node001 luna.service=0 net.ifnames=0 biosdevname=0 initrd=initrd.img boot=ramdisk
imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64
imgexec kernel
curl http://10.1.2.220:7051/boot/search/mac/00:50:56:03:13:4a
#!ipxe
imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64
imgload kernel
imgargs kernel root=luna luna.bootproto=static luna.mac=00:50:56:03:13:4a luna.ip=10.1.5.1/24 luna.gw= luna.url=https://10.1.2.220:7050 luna.verifycert=False luna.node=node001 luna.hostname=node001 luna.service=0 net.ifnames=0 biosdevname=0 initrd=initrd.img boot=ramdisk
imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64
imgexec kernel
regardless, iPXE can't find it..
http://10.1.5.240:7051/boot/mac/search/00:50:56:03:13:4a... No such file or directory (https://ipxe.org/2d0c613b)```
Hi there.i noticed two things:- http://10.1.2.220:7051/- luna.url=https://100.66.2.220:7050In a standard setup, these do not add up (as in: normally, the ipaddress is equal).could you give us the output of:- luna cluster- luna network list- luna network show
Seems to work fine on both internal node network as well as the main controller network ss -tulpn | grep 7051 tcp LISTEN 0 128 0.0.0.0:7051 0.0.0.0:* users:(("nginx",pid60,fd=8),("nginx",pid59,fd=8),("nginx",pid58,fd=8),("nginx",pid57,fd=8),("nginx",pid55,fd=8))
When testing with curl.. curl http://10.1.5.240:7051/boot/search/mac/00:50:56:03:13:4a
imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64 imgload kernel imgargs kernel root=luna luna.bootproto=static luna.mac�:50:56:03:13:4a luna.ip.1.5.1/24 luna.gw= luna.url=https://100.66.2.220:7050 luna.verifycert�lse luna.node=node001 luna.hostname=node001 luna.service=0 net.ifnames=0 biosdevname=0 initrd=initrd.img boot=ramdisk imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64 imgexec kernel
curl http://10.1.2.220:7051/boot/search/mac/00:50:56:03:13:4a
imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64 imgload kernel imgargs kernel root=luna luna.bootproto=static luna.mac�:50:56:03:13:4a luna.ip0.66.5.1/24 luna.gw= luna.url=https://100.66.2.220:7050 luna.verifycert�lse luna.node=node001 luna.hostname=node001 luna.service=0 net.ifnames=0 biosdevname=0 initrd=initrd.img boot=ramdisk imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64 imgexec kernel
regardless, iPXE can't find it.. http://10.1.5.240:7051/boot/mac/search/00:50:56:03:13:4a... No such file or directory (https://ipxe.org/2d0c613b)'''
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @github.com> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/clustervision/trinityX/issues/402#issuecomment-1969888807", "url": "https://github.com/clustervision/trinityX/issues/402#issuecomment-1969888807", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
WebMail 4.5 (c) 2001-2020 -=- https://www.djov.demon.nl
Hi there.i noticed two things:- http://10.1.2.220:7051/- luna.url=https://100.66.2.220:7050In a standard setup, these do not add up (as in: normally, the ipaddress is equal).could you give us the output of:- luna cluster- luna network list- luna network show
thanks!-Antoine----- On Wednesday, 28 February 2024 12:02 clustervision/trinityX Wrote: ----- Seems to work fine on both internal node network as well as the main controller network ss -tulpn | grep 7051 tcp LISTEN 0 128 0.0.0.0:7051 0.0.0.0:* users:(("nginx",pid60,fd=8),("nginx",pid59,fd=8),("nginx",pid58,fd=8),("nginx",pid57,fd=8),("nginx",pid55,fd=8)) When testing with curl.. curl http://10.1.5.240:7051/boot/search/mac/00:50:56:03:13:4a #!ipxe imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64 imgload kernel imgargs kernel root=luna luna.bootproto=static luna.mac�:50:56:03:13:4a luna.ip.1.5.1/24 luna.gw= luna.url=https://100.66.2.220:7050 luna.verifycert�lse luna.node=node001 luna.hostname=node001 luna.service=0 net.ifnames=0 biosdevname=0 initrd=initrd.img boot=ramdisk imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64 imgexec kernel curl http://10.1.2.220:7051/boot/search/mac/00:50:56:03:13:4a #!ipxe imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64 imgload kernel imgargs kernel root=luna luna.bootproto=static luna.mac�:50:56:03:13:4a luna.ip0.66.5.1/24 luna.gw= luna.url=https://100.66.2.220:7050 luna.verifycert�lse luna.node=node001 luna.hostname=node001 luna.service=0 net.ifnames=0 biosdevname=0 initrd=initrd.img boot=ramdisk imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64 imgexec kernel regardless, iPXE can't find it.. http://10.1.5.240:7051/boot/mac/search/00:50:56:03:13:4a... No such file or directory (https://ipxe.org/2d0c613b)''' —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @github.com> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "#402 (comment)", "url": "#402 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ] … _____ WebMail 4.5 (c) 2001-2020 -=- https://www.djov.demon.nl
ignore the 100.66. ip prefix; its 10.1 and 10.2, I've changed it but it seems that copy/paste didn't go well ;-)
Regardless, I'll provide the output of luna cluster
tomorrow as I'm currently without vpn to the test cluster :-)
Not sure if the problems persists, but there have been bug fixes since the reporting of this issue. Could you update Luna by running: ansible-playbook controller.yml --tags=luna within the the trinityx-combined/site directory?
If the problem is still there after the update, please let me know.
@aphmschonewille I've been very busy due to a numerous of reasons (priority for one regarding this project).. Therefore I wasn't able to continue with this -_-"
Regardless.. I may have found a clue...
So in the dhcp.conf, within the subnet 10.1.5.0 netmask 255.255.255.0 {
part
next-server 10.1.2.220
Is directed towards the external (WWW facing) NIC; .. Yet however,... when configuring the next-server
to be the same as the subnet 10.1.5.0
(node facing NIC) - so 10.1.5.240
; iPXE works..
Up until the point that it tries to boot after 'Ask luna-server for a node name`
It redirects back to 10.1.2.220
which results in pretty much the same issue in the first place that occurred prior to changing the next-server
facing the nodes NIC on the controller;
This is however where I'm stuck.... is the next-server
supposed to face towards the WWW facing NIC or the Node (DHCP) facing NIC??
Any clue on why the original config of next-server 10.1.2.220
might not work? Even though its (both are) reachable via curl?
(next-server 10.1.5.240; iPXE works... luna boot option -ask luna for node name- fails....)
curl http://10.1.5.240:7051/boot/search/mac/00:50:56:03:13:4a
#!ipxe
imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64
imgload kernel
imgargs kernel root=luna luna.bootproto=static luna.mac=00:50:56:03:13:4a luna.ip=10.1.5.20/24 luna.gw= luna.url=https://10.1.2.220:7050 luna.verifycert=False luna.node=node001 luna.hostname=node001 luna.service=0 initrd=initrd.img boot=ramdisk
imgfetch --name initrd.img http://100.66.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64
imgexec kernel
(next-server 10.1.2.220; iPXE fails....)
curl http://10.1.2.220:7051/boot/search/mac/00:50:56:03:13:4a
#!ipxe
imgfetch -n kernel http://10.1.2.220:7051/files/compute-1707223911-vmlinuz-4.18.0-372.26.1.el8_6.x86_64
imgload kernel
imgargs kernel root=luna luna.bootproto=static luna.mac=00:50:56:03:13:4a luna.ip=10.1.5.20/24 luna.gw= luna.url=https://10.1.2.220:7050 luna.verifycert=False luna.node=node001 luna.hostname=node001 luna.service=0 initrd=initrd.img boot=ramdisk
imgfetch --name initrd.img http://10.1.2.220:7051/files/compute-1707223911-initramfs-4.18.0-372.26.1.el8_6.x86_64
imgexec kernel
EDIT: Most likely due to the fact that the node itself only has one NIC - on 10.1.5.0 subnet; It doesn't know how to find the 10.1.2.0 subnet; Are nodes supposed to have access to both WWW facing NIC's as well as a Controller <---> Node NIC connection?
@aphmschonewille , @msteggink
An exact representation of what's going on...
The bootloader is referring back to the network that isn't reachable... is it possible to make sure everything is reachable via the Node network -only- ? (This includes changing the boot loader to include the correct address of the node facing NIC)
Hi,
i start to better understand your problem better. I think the ip addresses, internal and external of the controller are mixed up or only the controller's external ip was used. If this is the case, i am afraid the approach will not work (optimally) as you have noticed.
If i misunderstood, could you help me explaining how your setup looks like?:
___________ ___________ ___ node001
/ \ +------------+ / \ /
-----| external |---------| controller |--------| internal |----- node002
| net | ^ +------------+ ^ | net | \
\___________/ | | \___________/ --- node003
x.x.5.240? x.x.x.x?
-A
___________ ___________ ___ node001
/ \ +------------+ / \ /
-----| external |----ens192-----| controller |---ens256-----| internal |----- node002
| net | ^ +------------+ ^ | net | \
\___________/ | | \___________/ --- node003
x.x.5.240? x.x.x.x?
This is pretty much accurate. The nodes cannot reach ens192; (This is a similar set up by one of the Cluster Vision GPU Cluster projects in the past where I'm currently hired.)
@aphmschonewille - I've contacted our hosting provider to fix their routing (and add some ports).. to be continued :-)
(note... below is apparently easy to accomplish; Might be an idea to add the option to ansible when creating the images luna_undionly.kpxe and luna_ipxe.efi; (example - make bin-x86_64-efi/ipxe.efi DEBUG=tcp,ipv4 --> ansible-play ........ --tags=ipxe_debugging (or something) )
Hi @xdkreij,
Normally compute nodes do not have to contact ens192 of the controller, as all relevant services are provided on ens256. When you run ansible the first time, trix_ctrl1_hostip is being used to determine which interface (IP) is the internal one, and luna is configured to use that as nextserver in dhcp. No magic here, but if not set correctly, you may end up ising the wrong interface where you need to do routing etc to make it work. This however can be altered using luna-cli (check: luna cluster, and luna network). May be i misunderstood why the nodes need to reach ens192 (and further?), so please correct me if i'm wrong.
Building PXE kernels with debugging is something we can add as a flag, but in normal circumstances you probably never need it. I have never really used it (may be once trying to make https without cert work?) in many years of booting nodes. I put it on the agenda though.
thanks! -A
can you provide the output of
-A
Seems the PXE boot issue has been resolved by moving to a single interface within the 'all.yml' file. PXE now continues to boot (up to another compute image issue I'll have to deal with later). This was indeed a routing challenge. Ticket can be closed. Thanks for helping out!
Problem description When attempting to boot a node with iPXE the following happens:
several reboots are required before iPXE fetches /boot from http, before that, it attempts to fetch the filename directly, this seems to be unstable. However, i am attempting this from a virtual machine, so the dhcpd.conf might need some refinement to get this to a stable state.
Once the boot menu starts, it can't fetch any default; "Ask Luna-server for node name", if we let the timer count down, it simply reboots the VM
Once boot menu starts, it will only continue booting the node when the option "Choose first available node in category or (g)group" has been selected.
Once the boot continues, it grinds to a halt with the message:
Luna2: Could not get install script [000]. Sleeping 10 sec.
dhcpd.conf
What has been done
Expected results
Luna2: Could not get install script [000]. Sleeping 10 sec