Closed larsks closed 1 year ago
It looks like the answer here is "yes", now that we're running recent firmware on the prod hosts. @hakasapl there's an Ansible inventory to help with this available at https://github.com/OCP-on-NERC/nerc-ansible, along with some example playbooks that use racadm or redfish to interact with the hardware (we'll probably use the redfish api for this).
See e.g. rf-get-inventory.yaml
@hakasapl can you publish a gpg key at https://keys.openpgp.org/ (and include the link here)? I'll use that to get you some credentials.
@hakasapl what is the status of gathering the PCI slot information on the NERC clusters?
I have a VPN account now but trying to figure out client issues with multiple local users - this is still ongoing
Making use of the scripts found in this repo: https://github.com/dell/iDRAC-Redfish-Scripting and using the host list found here
The following hosts did not support redfish (@larsks do the controllers support redfish?):
ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-0-obm.nerc-ocp-prod.rc.fas.harvard.edu
ctl-1-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu
ctl-2-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-2-obm.nerc-ocp-prod.rc.fas.harvard.edu
The following hosts did not reply at all (These are all test nodes so I assume this is intended):
wrk-0-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-1-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-2-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-3-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-4-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-5-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-6-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-7-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-8-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-9-obm.nerc-ocp-test.rc.fas.harvard.edu
Everything else replied correctly, and every 10G NIC is in PCIe bus-slot 1-0
. So the NICs are consistent. I'm attaching the raw output I'm basing this on below:
@hakasapl The controllers do support redfish:
$ https -pb --verify false --auth '...' https://ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu/redfish/v1/Systems
{
"@odata.etag": "\"27f6eb13fa1c28a2711\"",
"@odata.id": "/redfish/v1/Systems",
"@odata.type": "#ComputerSystemCollection.ComputerSystemCollection",
"Description": "A collection of ComputerSystem resource instances.",
"Members": [
{
"@odata.id": "/redfish/v1/Systems/1"
}
],
"Members@odata.count": 1,
"Name": "ComputerSystemCollection"
}
Note that the controllers are not Dell systems; you'll find the "system" resource at a different path, which is why the rf-get-inventory playbook looks like this:
- name: get system
delegate_to: localhost
uri:
url: "https://{{ bmc_addr }}/redfish/v1/Systems/{{ rf_system_id }}"
user: "{{ bmc_user }}"
password: "{{ bmc_password }}"
validate_certs: false
force_basic_auth: true
The rf_system_id
variable comes from the inventory and is set based on the system vendor, but you can also just discover it by first requesting /redfish/v1/Systems
and then getting the value from the Members
array.
The following hosts did not reply at all (These are all test nodes so I assume this is intended):
Yeah, that's expected.
@larsks thanks for that info. The controllers do not have consistent PCIe slots. It looks like the ID is reported differently on these nodes, as they are not Dells. They have slot numbers directly:
ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu: slot_6
ctl-0-obm.nerc-ocp-prod.rc.fas.harvard.edu: slot_5
ctl-1-obm.nerc-ocp-infra.rc.fas.harvard.edu: slot_5
ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu: slot_6
ctl-2-obm.nerc-ocp-infra.rc.fas.harvard.edu: slot_5
ctl-2-obm.nerc-ocp-prod.rc.fas.harvard.edu: slot_5
Do we need to make these consistent?
Do we need to make these consistent?
It would simplify things. It looks like that means moving the cards in these hosts into slot 5:
ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu
We are using the infra cluster, so we should coordinate shutting it down.
Next week I will be at MGHPCC on Friday (the 29th) instead of Thursday. Could we schedule a downtime for that day? I'll be there the following week as well so no worries if next week doesn't work.
It looks like ctl-0
and ctl-1
will need to be shut down for the operation. ctl-2
can stay running.
@larsks do we want this to be done before we start the Prod installation for ACM configuration purposes?
@hakasapl since ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu
is on the infra and ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu
is on the prod that would only be 1 host on each cluster that needs to be taken down so yes infra should be able to stay running with ctl-1
and ctl-2
and prod is not running yet but if it is when you do this it should be able to stay up on ctl-0
and ctl-2
.
Since these are SD530s I don't think I'll be able to swap the slot location and that may just be how the blade chassis is setup. Assuming I have key access by then I'll take a look next Friday (the 29th)
@joachimweyl I don't think this is critical for next week.
@hakasapl what did we find out on Friday? Is there a way to move these NIC's or are the SD530s not swappable?
@joachimweyl I'm waiting on key access to R7 to be able to take a look. @msdisme any updates on that?
Scott said: I’m going to have to ask John Goodhue to create a new keyring with access for University RC and which racks those are associated with. Otherwise, Hakan would have access to the entire row. Also you could do this lspci command and see what slots have which devices (I think we already know the answer to that?)
quick follow up - in this case I think we need to either know if NICs may be moved or look to see. Needs to be coordinated with shutting down of infra and we want to do it before we start depending on infra.
@jtriley - this is the ticket we are tracking against for the NICs in infra.
Blocked while waiting for R7 key access
@jtriley - any updates on this ? (comment here: https://github.com/CCI-MOC/ops-issues/issues/618#issuecomment-1204277993)
@jtriley - any updates on having teams check on this (is moving card possible?) and/or move the card.
From NERC weekly meeting: these cannot be physically changed, looking into changing this in BMC or BIOS settings
Assigning to myself to keep track with @jtriley
@jtriley Any update on the BIOS for these NICs?
@jtriley were we ever able to test the BIOS to see if that helped with the NICs?
@jtriley were we ever able to get the BIOS NIC location checked? @larsks is this still worth rebooting the machines to check the BIOS?
@larsks says to just go with the work around and not worry about this.
We would like to verify that network cards in the NERC openshift nodes (both controllers and workers) are plugged into identical PCI slots. Can we get this information via the iDRAC so that we don't need to physically inspect the nodes? Or can we arrange to boot them into some sort of discovery image?