OpenFabrics / fsdp_setup

Setup scripts for use with the FSDP cluster
GNU Lesser General Public License v2.1
0 stars 1 forks source link

Final issues to declare phase I complete #69

Closed dledford closed 2 years ago

dledford commented 2 years ago

builder-00: need to update interface definitions to match fabric layout so dhcp will work properly. See issue #63. Also need to update the system /etc/opa-fm/opafm.xml file with the one in ~dledford/opafm.xml (needs a test, I'm not positive that the mcast MTU I set on partition 8022 will work, if not, it needs reduced to the maximum allowed value, but I think 10240 is the maximum allowed value on the fabric).

node-01 through node-04: opa issues related to the need to update the opafm.xml file on builder-00, and related dhcp issues also due to the same issue

node-01: x722 interface has cable plugged into port2 instead of port1. We can either a) move the cable and update the machine definition be reverting the change that setup port2 as the iw port, or b) update test_nodes.md to reflect the actual port usage configuration.

node-02: Same x722 issue as node-01

node-07: Broadcom card is not coming up. The kernel shows the card as a single port card, while the test_nodes.md document shows that it should be dual port. It may need someone to check the card's settings in the BIOS to make sure it is configured properly.

node-09: Both ports on the Marvell card are down. Port1 should be up and configured as a RoCE port on the switch. Port2 should be up and configured as an iWARP port on the switch.

node-10: Ditto Marvell card problem

All InfiniBand nodes: Not all IB subnets are coming up. The switch probably needs checked to make sure that the subnet manager is setup properly. For node-09 and node-10, which require bifurcated IB support, the second PCI devices are not coming up at all. This needs debugging on the switch too.

lylavoie commented 2 years ago

OPA fabric changes should be all set, these were changes in the interface configs and the FM needed to be updated as well (done now) 8024 was missing from the XML. Still need to check the MTU.

nodes 01 and 02, x722 card, we've swapped the ports, but now in Linux (as currently setup), is see two interfaces, x722_off (f8:f2:1e:bd:4d:be) and x722_iw (f8:f2:1e:bd:4d:bf), where x722_off now has Link-Up.

node-07 Broadcom is being tracked in https://github.com/OpenFabrics/fsdp_docs/issues/81

node-09 / 10 Marvell card, you mean the QLogic Card? The switch is configured for 25G breakout, and cables are all connected. Need to check the server BIOS.

Node 09 / 10 IB port 2 - I'm not sure on this one, what needs to change on the switch config to "enable" the second port on the other PCI.

lylavoie commented 2 years ago

@dledford will contact Mellanox about the bi-PCI card setup. @lylavoie will create a separate issue to track the node-09/10 Qlogic card.