Closed tzumainn closed 1 month ago
After some investigation, this is what I saw for MOC-R4PAC08U37-S1A.
The node has this bare metal port:
+-----------------------+--------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------------------------------------------------------------------------------------+
| address | e0:d8:48:e3:0c:81 |
|
| local_link_connection | {'switch_info': 'MOC-R4PAC08-SW-TORS-A', 'port_id': 'tengigabitethernet 1/27/1', 'switch_id': 'd8:9e:f3:ae:e5:a2'} |
When I provision the node, it initially attaches to the provisioning network (623); however I see two matches on the switch's mac address table:
MOC-R4PAC08-SW-TORS-A#show mac-address-table | grep 1/27/1
623 e0:d8:48:e3:0c:81 Dynamic Te 1/27/1 Active
623 e0:d8:48:e3:0c:83 Dynamic Te 1/27/1 Active
After provisioning is over, the node is switched over to the private network (628), and only one entry is left - which doesn't correspond to the bare metal port mac address in Ironic:
MOC-R4PAC08-SW-TORS-A#show mac-address-table | grep 1/27/1
628 e0:d8:48:e3:0c:83 Dynamic Te 1/27/1 Active
@hakasapl @naved001 I'm not sure what to make of this behavior!
Looks like the mac address ending in 83 is "Virtual FIP MAC address" and can be seen under device properties
FIP = FCoE Initialization Protocol and FCoE = Fibre Channel over Ethernet. Now why does this protcol need it's own mac address I am not sure of that. This is the first time I am seeing this.
What exactly did you have to do to get it to successfully boot?
For what it's worth, it only tried to pxe boot with the mac address that's stored in ironic so it should just work.
I did look into disabling whatever that mac is, but couldn't find any options in the NIC configuration.
I should clarify: it always successfully PXE booted for me, which makes sense because during provisioning there's an entry for the Ironic MAC address in the mac address table.
What fails is after provisioning, when the node is moved onto the private network. When that happens, the entry for the Ironic MAC address goes away, and only the virtual FIP MAC address can be found in the mac address table. Because of the mismatch, the fixed IP doesn't work and the node is unreachable.
After playing around with one of the nodes on Friday, I was able to reproduce the issues with a centos9 image but not with an ubuntu image.
My suspicion is that the centos9 image doesn't correctly bring the interface up so the switch doesn't see that mac address on the interface but only sees the virtual FIP mac (which isn't visible in the OS as far as I can tell, and looks like it's just always active).
I am going to investigate bit more.
I tested the centos9-stream-whole
image and ran into the same issue. I'm going to try the centos8 image next
I set the root password for a centos9 machine from single-user mode (the shell prompt was on the serial console). After that I tried to bring the interface up but it failed with the following error:
[root@MOC-R4PAC08U37-S1B ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether e0:d8:48:e3:0d:6f brd ff:ff:ff:ff:ff:ff
altname eno1
altname enp1s0f0
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether e0:d8:48:e3:0d:72 brd ff:ff:ff:ff:ff:ff
altname eno2
altname enp1s0f1
[root@MOC-R4PAC08U37-S1B ~]# ip link set eth0 up
[ 505.548593] bnx2x 0000:01:00.0: Direct firmware load for bnx2x/bnx2x-e2-7.13.21.0.fw failed with error -2
[ 505.548638] bnx2x 0000:01:00.0: Direct firmware load for bnx2x/bnx2x-e2-7.13.15.0.fw failed with error -2
[ 505.548641] bnx2x: [bnx2x_func_hw_init:6004(eth0)]Error loading firmware
[ 505.548651] bnx2x: [bnx2x_nic_load:2736(eth0)]HW init failed, aborting
RTNETLINK answers: No such file or directory
This suggests that there's something wrong with bnx2x (Qlogic NIC) firmware that's being used in this centos image. So, the interface never comes so the switch never sees this mac address.
This image may work for machines that use a different type of NIC.
I did get the networks up on this image.
I downloaded the firmware files from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/bnx2x
the 2 files it's looking for are: bnx2x-e2-7.13.21.0.fw and bnx2x-e2-7.13.15.0.fw.
I downloaded those, created an ISO which I then mounted via the idrac. After that I created a directory /lib/firmware/bnx2x/
and copied the firmware files to it, and then ran rmmod bnx2x
and then modprobe bnx2x
to reload the kernel module and the NIC was successfully up
[root@MOC-R4PAC08U37-S1B bnx2x]# rmmod bnx2x
[root@MOC-R4PAC08U37-S1B bnx2x]# modprobe bnx2x
[ 7142.669125] bnx2x 0000:01:00.0: msix capability found
[ 7142.681353] bnx2x 0000:01:00.0: part number 0-0-0-0
[ 7142.793417] bnx2x 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[ 7142.793531] bnx2x 0000:01:00.1: msix capability found
[ 7142.805301] bnx2x 0000:01:00.1: part number 0-0-0-0
[ 7143.486643] bnx2x 0000:01:00.0 eth0: using MSI-X IRQs: sp 37 fp[0] 39 ... fp[7] 46
[ 7143.559112] bnx2x 0000:01:00.0 eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
[ 7143.640925] bnx2x 0000:01:00.1: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
After that I brought the interface up and it got an IP
[root@MOC-R4PAC08U37-S1B bnx2x]# ip r
default via 192.168.50.1 dev eth0 proto dhcp src 192.168.50.166 metric 100
169.254.169.254 via 192.168.50.10 dev eth0 proto dhcp src 192.168.50.166 metric 100
192.168.50.0/24 dev eth0 proto kernel scope link src 192.168.50.166 metric 100
This is obviously not a feasible solution, but it's clear what the problem is. We need to build the image with whatever kernel has the firmware files.
Useful thread: https://groups.google.com/g/linux.debian.bugs.dist/c/hDfdrz9gODI
The new centos9-stream (built with parameters suggested by derek) seems to have solved this issue.
These nodes provision, but one cannot ping them afterwards on their fixed IP (using
ip netns
on the controller against the relevant router). Another node worked. It looked like post provisioning, the MAC address listed in Ironic did not correspond to what showed up in the mac address table on the switch