CCI-MOC / ops-issues

2 stars 0 forks source link

Check that NERC Controllers have NICs in the same physical slots #618

Closed larsks closed 1 year ago

larsks commented 2 years ago

We would like to verify that network cards in the NERC openshift nodes (both controllers and workers) are plugged into identical PCI slots. Can we get this information via the iDRAC so that we don't need to physically inspect the nodes? Or can we arrange to boot them into some sort of discovery image?

larsks commented 2 years ago

It looks like the answer here is "yes", now that we're running recent firmware on the prod hosts. @hakasapl there's an Ansible inventory to help with this available at https://github.com/OCP-on-NERC/nerc-ansible, along with some example playbooks that use racadm or redfish to interact with the hardware (we'll probably use the redfish api for this).

larsks commented 2 years ago

See e.g. rf-get-inventory.yaml

larsks commented 2 years ago

@hakasapl can you publish a gpg key at https://keys.openpgp.org/ (and include the link here)? I'll use that to get you some credentials.

hakasapl commented 2 years ago

Sure, here it is: https://keys.openpgp.org/vks/v1/by-fingerprint/37D04530CD75ED9F7437263C7162C8607231A566

joachimweyl commented 2 years ago

@hakasapl what is the status of gathering the PCI slot information on the NERC clusters?

hakasapl commented 2 years ago

I have a VPN account now but trying to figure out client issues with multiple local users - this is still ongoing

hakasapl commented 2 years ago

Making use of the scripts found in this repo: https://github.com/dell/iDRAC-Redfish-Scripting and using the host list found here

The following hosts did not support redfish (@larsks do the controllers support redfish?):

ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-0-obm.nerc-ocp-prod.rc.fas.harvard.edu
ctl-1-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu
ctl-2-obm.nerc-ocp-infra.rc.fas.harvard.edu
ctl-2-obm.nerc-ocp-prod.rc.fas.harvard.edu

The following hosts did not reply at all (These are all test nodes so I assume this is intended):

wrk-0-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-1-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-2-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-3-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-4-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-5-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-6-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-7-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-8-obm.nerc-ocp-test.rc.fas.harvard.edu
wrk-9-obm.nerc-ocp-test.rc.fas.harvard.edu

Everything else replied correctly, and every 10G NIC is in PCIe bus-slot 1-0. So the NICs are consistent. I'm attaching the raw output I'm basing this on below:

pcie-nerc-output.txt

larsks commented 2 years ago

@hakasapl The controllers do support redfish:

$ https -pb --verify false --auth '...' https://ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu/redfish/v1/Systems
{
    "@odata.etag": "\"27f6eb13fa1c28a2711\"",
    "@odata.id": "/redfish/v1/Systems",
    "@odata.type": "#ComputerSystemCollection.ComputerSystemCollection",
    "Description": "A collection of ComputerSystem resource instances.",
    "Members": [
        {
            "@odata.id": "/redfish/v1/Systems/1"
        }
    ],
    "Members@odata.count": 1,
    "Name": "ComputerSystemCollection"
}

Note that the controllers are not Dell systems; you'll find the "system" resource at a different path, which is why the rf-get-inventory playbook looks like this:

        - name: get system
          delegate_to: localhost
          uri:
            url: "https://{{ bmc_addr }}/redfish/v1/Systems/{{ rf_system_id }}"
            user: "{{ bmc_user }}"
            password: "{{ bmc_password }}"
            validate_certs: false
            force_basic_auth: true

The rf_system_id variable comes from the inventory and is set based on the system vendor, but you can also just discover it by first requesting /redfish/v1/Systems and then getting the value from the Members array.

larsks commented 2 years ago

The following hosts did not reply at all (These are all test nodes so I assume this is intended):

Yeah, that's expected.

hakasapl commented 2 years ago

@larsks thanks for that info. The controllers do not have consistent PCIe slots. It looks like the ID is reported differently on these nodes, as they are not Dells. They have slot numbers directly:

ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu: slot_6
ctl-0-obm.nerc-ocp-prod.rc.fas.harvard.edu: slot_5
ctl-1-obm.nerc-ocp-infra.rc.fas.harvard.edu: slot_5
ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu: slot_6
ctl-2-obm.nerc-ocp-infra.rc.fas.harvard.edu: slot_5
ctl-2-obm.nerc-ocp-prod.rc.fas.harvard.edu: slot_5

Do we need to make these consistent?

larsks commented 2 years ago

Do we need to make these consistent?

It would simplify things. It looks like that means moving the cards in these hosts into slot 5:

We are using the infra cluster, so we should coordinate shutting it down.

hakasapl commented 2 years ago

Next week I will be at MGHPCC on Friday (the 29th) instead of Thursday. Could we schedule a downtime for that day? I'll be there the following week as well so no worries if next week doesn't work.

It looks like ctl-0 and ctl-1 will need to be shut down for the operation. ctl-2 can stay running.

joachimweyl commented 2 years ago

@larsks do we want this to be done before we start the Prod installation for ACM configuration purposes?

joachimweyl commented 2 years ago

@hakasapl since ctl-0-obm.nerc-ocp-infra.rc.fas.harvard.edu is on the infra and ctl-1-obm.nerc-ocp-prod.rc.fas.harvard.edu is on the prod that would only be 1 host on each cluster that needs to be taken down so yes infra should be able to stay running with ctl-1 and ctl-2 and prod is not running yet but if it is when you do this it should be able to stay up on ctl-0 and ctl-2.

hakasapl commented 2 years ago

Since these are SD530s I don't think I'll be able to swap the slot location and that may just be how the blade chassis is setup. Assuming I have key access by then I'll take a look next Friday (the 29th)

larsks commented 2 years ago

@joachimweyl I don't think this is critical for next week.

joachimweyl commented 2 years ago

@hakasapl what did we find out on Friday? Is there a way to move these NIC's or are the SD530s not swappable?

hakasapl commented 2 years ago

@joachimweyl I'm waiting on key access to R7 to be able to take a look. @msdisme any updates on that?

msdisme commented 2 years ago

Scott said: I’m going to have to ask John Goodhue to create a new keyring with access for University RC and which racks those are associated with. Otherwise, Hakan would have access to the entire row. Also you could do this lspci command and see what slots have which devices (I think we already know the answer to that?)

msdisme commented 2 years ago

quick follow up - in this case I think we need to either know if NICs may be moved or look to see. Needs to be coordinated with shutting down of infra and we want to do it before we start depending on infra.

msdisme commented 2 years ago

@jtriley - this is the ticket we are tracking against for the NICs in infra.

  1. Ask about the hardware and if the NICS may be moved
  2. coordinate with Harvard team who owns hardware to move them in coordination with ocp-nerc team (this is to get progress while on seperate path Scott arranges keyring access which will likely take longer.)
hakasapl commented 2 years ago

Blocked while waiting for R7 key access

msdisme commented 2 years ago

@jtriley - any updates on this ? (comment here: https://github.com/CCI-MOC/ops-issues/issues/618#issuecomment-1204277993)

msdisme commented 2 years ago

@jtriley - any updates on having teams check on this (is moving card possible?) and/or move the card.

hakasapl commented 2 years ago

From NERC weekly meeting: these cannot be physically changed, looking into changing this in BMC or BIOS settings

joachimweyl commented 2 years ago

Assigning to myself to keep track with @jtriley

joachimweyl commented 1 year ago

@jtriley Any update on the BIOS for these NICs?

joachimweyl commented 1 year ago

@jtriley were we ever able to test the BIOS to see if that helped with the NICs?

joachimweyl commented 1 year ago

@jtriley were we ever able to get the BIOS NIC location checked? @larsks is this still worth rebooting the machines to check the BIOS?

joachimweyl commented 1 year ago

@larsks says to just go with the work around and not worry about this.